1 Introduction

Image super-resolution methods are the research focus of computer vision tasks and have received much attention for many years. Given a low-resolution image, super-resolution (SR) techniques aim to recover its corresponding high-resolution (HR) image [1]. Since one low-resolution (LR) image may correspond to multiple high-resolution images, the super-resolution problem is an ill-conditioned inverse problem [2][3]. Therefore, how to efficiently restore the lost texture details in the process of super-resolution reconstruction, maintain the integrity of the image structure, and effectively suppress the generation of distortion is a challenging problem [4][5]. Super-resolution imaging is one hot topic in the field of computer vision, and the technology of deep learning dominates the current research on single-image super-resolution (SISR) methods [6].

The deep learning-based SISR method directly learns the end-to-end mapping relationship between LR images and HR images. Dong et al. [7] had proposed the SISR method based on the convolutional neural network [8], called super-resolution convolutional neural network (SRCNN). SRCNN used three convolutional layers to directly learn nonlinear mappings of LR images and HR images in an end-to-end model. Kim et al. [9] had proposed a deeper network named very deep convolutional networks (VDSR) based on residual learning [10], which effectively improved the performance. To increase the network depth and limit the increase in network parameters, Kim et al. [11] adopted the recursive structure of shared parameters and proposed the deeply recursive convolutional network (DRCN). Tai et al. [12] had proposed deep recursive residual network (DRRN), which simultaneously utilized local residual structure, global residual structure, and recursive structure. The residual units are shared and a small number of parameters are added, which improves the performance of VDSR and DRCN. The super-resolution feedback network (SRFBN-S) [13] had used a recurrent neural network structure to share hidden layer parameters, reducing the number of parameters and improving the performance of reconstructed images. Hui et al. [14] had proposed the information multi-distillation network (IMDN) method to gradually extract feature information within the residual block and use the channel attention mechanism for feature selection, which further improves the quality of the reconstructed image. Ahn et al. [15] had proposed a SISR method based on a cascaded residual network, called cascading residual network (CARN), which combined cascaded structure and residual learning and achieved a better balance between the number of parameters and performance. Zhu et al. [16] had proposed the compact back-projection network (CBPN) method, which enhanced the reconstruction ability by extracting feature information in LR and HR spaces by cascading up/down sampling layers. The multi-scale residual network (MSRN) [17] had used convolutional layers with different receptive fields in the residual block to extract feature information of different scales to further improve the performance. Lai et al. [18] had proposed the Laplacian pyramid network structure of the SISR method Laplacian pyramid super-resolution network (LapSRN), which gradually upsamples and predicts residuals, which can simultaneously complete HR image reconstruction of multiple sizes.

Most of the existing algorithms achieve the goal of improving network performance by deepening or widening the network. However, the large size of the network model will bring two problems: (1) a large-sized model will consume too much storage space, which is not conducive to deployment in practical applications; (2) a large-sized model will introduce a heavy computational burden. It is not suitable for applications with limited computing power or high real-time requirements. Therefore, the trade-off between model size and SR reconstruction performance needs to be specially considered when designing the network.

LapSRN uses a progressive method to increase the image resolution, which will generate large-sized feature maps in the middle of the network, resulting in a significant increase in network computation. MFRN introduces a recursive learning strategy, although a larger receptive field can be obtained with a small number of parameters, the computational burden is high. CARN fuses information from multiple levels by applying a cascade mechanism but also introduces more parameters and calculations due to the use of dense connections. In CARN’s mobile version model named CARN-M, group convolution is used to reduce the number of network parameters, but the use of group convolution will significantly reduce network performance. IDN uses an information distillation mechanism to transmit some features through skip connections to reduce the number of network parameters, but it cannot effectively screen out important features that need to be further refined, and there is still room for further improvement in model performance.

The above methods use a lightweight network, however, the network depth and the number of parameters is important factors that affect the performance of SISR. Lim et al. [19] had proposed a heavyweight EDSR method that removed the normalization module, superimposed residual blocks, and more than 65 convolutional layers [20]. The MM-RealSR method [38] combined residual structure and dense structure and made full use of the hierarchical feature information of LR images and can recover high-quality HR images. Liu et al. [22] had proposed the RFANet method, which uses a spatial attention module with a larger receptive field and a smaller number of parameters in the residual block to filter feature information, and then analyze the features extracted by the residual branch of each residual block. Fusion improves the quality of image reconstruction. EDSR, MM-RealSR, and RFANet methods are currently the most representative methods that use heavyweight network SISR, which have good performance and large parameters. In the case of limited resources, it is difficult for the heavyweight SISR model to meet the application requirements. This paper considers the lightweight SISR model to provide an effective solution.

This paper proposes a SISR method for lightweight multi-level feature fusion networks. When the magnification factor is 4 times, the amount of multi-level features fusion network (MFFN) parameters is only 1.47 M, which is 1/29 of EDSR, 1/14 of MM-RealSR, and 1/7 of RFANet. Compared with similar lightweight SISR models, our method achieves a better balance in performance and model scale. Taking the MSRN method as an example, the parameters of the proposed method are reduced by 3/4. On the test dataset, × 2, × 3, and × 4 are enlarged, and the objective performance is comparable, while × 8 is enlarged, which is consistently better than the MSRN method. For × 4 and × 8 upscaling, the subjective performance is also consistently better than the MSRN method. The experimental results can show that the ability of the proposed method to reconstruct the fringes is significantly better than other lightweight methods, and for the 8 times large-scale magnification factor, the reconstructed image results have more obvious advantages.

The contributions of this paper include: (1) We propose the dual residual block (DRB) with asymmetric structure. The residual block is first expanded twice, then compressed twice, and the two layers of residual connection are used to effectively extract feature information; (2) An autocorrelation weight unit (ACW) has proposed, which can calculate the weight according to the feature information, adaptively weighting different feature channels to effectively transfer feature information; (3) Designing the shallow feature mapping unit (SFMU), SFMU extracts different levels of shallow features through the convolution layers of different receptive fields on each branch; (4) Designing the multi-path reconstruction unit (MPRU), MPRU can obtain the feature information of multiple branches, to fully utilize the feature information of different levels to reconstruct different aspects of the image.

2 Proposed algorithm

Currently, most super-resolution models utilize residual networks [19, 23]. The residual block structure generally adopts a Conv-ReLU-Conv layer. The problem is that the model performance is heavily dependent on the network size, which is mainly based on the number of trainable network layers and channels. How to reduce the network size while improving or not reducing the model performance is a very challenging problem. In this paper, a lightweight multi-level feature fusion network is designed. The feature channel is first expanded and then compressed with two-layer nested residual blocks, which can significantly reduce the number of parameters, and the autocorrelation weight unit can adaptively fuse feature information, which also improves the feature utilization effect. The network structure of this paper is shown in Fig. 1a, which mainly includes four parts: shallow feature extraction unit (SFEU), shallow feature mapping unit (SFMU), deep feature mapping unit (DFMU), and multi-path reconstruction unit (MPRU).

Fig. 1
figure 1

a Architecture of multi-level features fusion network. b Structure of the residual group. c Symbol description

Let \(I_{LR}\) and \(I_{SR}\) be the input and output images, the shallow feature extraction unit only contains a \(3 \times 3\) convolution layer, which can realize the transformation function of extracting shallow feature information and feature dimension.

$$ F_{0} { = }H_{SFEU} \left( {I_{LR} } \right) $$
(1)

\(H_{SFEU}\) is the shallow feature extraction unit, which generates the shallow feature information \(F_{0}\) that meets the dimension requirements of the shallow/deep feature mapping unit from the input image \(I_{LR}\). The shallow feature mapping unit further extracts shallow feature information from \(F_{0}\) and passes the shallow feature information to the multi-path reconstruction unit.

$$ \left( {F_{{S_{1} }} ,F_{{S_{2} }} ,F_{{S_{3} }} } \right) = H_{SFMU} \left( {F_{0} } \right) $$
(2)

\(\left( {F_{{S_{1} }} ,F_{{S_{2} }} ,F_{{S_{3} }} } \right)\) is the shallow feature information extracted by the shallow feature mapping unit \(H_{SFMU}\). The deep feature mapping unit also extracts deep feature information from \(F_{0}\).

$$ \left( {F_{{D_{1} }} ,F_{{D_{2} }} ,F_{{D_{3} }} } \right) = H_{DFMU} \left( {F_{0} } \right) $$
(3)

\(\left( {F_{{D_{1} }} ,F_{{D_{2} }} ,F_{{D_{3} }} } \right)\) is the deep layer feature information generated by the deep feature mapping unit \(H_{DFMU}\), and then the multi-path reconstruction unit receives the shallow layer feature information and the deep layer feature information to reconstruct the final result \(I_{SR}\).

$$ I_{SR} { = }H_{MPRU} \left( {F_{{S_{1} }} ,F_{{S_{2} }} ,F_{{S_{3} }} ,F_{{D_{1} }} ,F_{{D_{2} }} ,F_{{D_{3} }} } \right) $$
(4)

\(H_{MPRU}\) is the multi-path reconstruction unit, which reconstructs the image with all the feature information to generate the final result \(I_{SR}\).

  1. A.

    Shallow Features Mapping Unit (SFMU)

The SISR models usually used one 3 × 3 convolution to extract shallow features. The shallow features mapping unit (SFMU) module used convolution kernels with different receptive fields to extract shallow feature information hierarchically with multi-scale and multi-level shallow feature information. The rich shallow feature information can help the reconstruction module to reconstruct higher-quality SR images.

The SFMU module first transforms the input information through one 1 × 1 convolution to reduce the number of parameters in subsequent operations. Then, three convolutional layers with different receptive fields are used to achieve multi-scale shallow feature information extraction. The three convolution layers are divided into three branches, and the shallow feature information is gradually weighted and superimposed. It realizes the extraction of multi-scale and multi-level shallow feature information.

The three branches of the shallow feature mapping unit extract different levels of shallow feature information \(\left( {F_{{S_{1} }} ,F_{{S_{2} }} ,F_{{S_{3} }} } \right)\), respectively.

$$ F_{{S_{0} }} { = }H_{T} \left( {F_{0} } \right) $$
(5)
$$ F_{{S_{1} }} { = }H_{{C_{1} }} \left( {F_{{S_{0} }} } \right) $$
(6)
$$ F_{{S_{2} }} { = }H_{{C_{2} }} \left( {F_{{S_{0} }} { + }\alpha_{1} F_{{S_{1} }} } \right) $$
(7)
$$ F_{{S_{3} }} { = }H_{{C_{3} }} \left( {F_{{S_{0} }} { + }\alpha_{2} F_{{S_{2} }} } \right) $$
(8)

\(H_{T}\) is the 1 × 1 convolutional layer used to adjust the number of feature information channels, \(H_{{C_{1} }}\), \(H_{{C_{2} }}\), \(H_{{C_{3} }}\) are the convolutional layers on each branch, and \(\alpha_{1}\), \(\alpha_{2}\) are the adaptive weights that can be learned.

  1. B.

    Deep Features Mapping Unit (DFMU)

To obtain deep feature information, this paper designs a DFMU module, as shown in Fig. 1(a). The deep features mapping unit (DFMU) module contains three residual groups (RG), and each RG contains multiple DRB modules. Because the simple stacking residual block method is not conducive to the transfer of feature information, this paper adds local skip connections in each RG to promote the effective transfer of feature information and obtain deep-level feature information through RG.

$$ F_{{D_{i} }} { = }H_{{RG_{i} }} \left( {F_{{D_{i - 1} }} } \right) $$
(9)

The \(i^{th}\) residual group \(H_{{RG_{i} }}\) takes the hierarchical feature information \(F_{{D_{i - 1} }}\) generated by the \(\left( {i - 1} \right)^{th}\) residual group as input and generates hierarchical feature information \(F_{{D_{i} }}\).

2.1 Dual residual block (DRB)

The usually used residual structure is shown in Fig. 2a, and each convolutional layer has the same number of channels. A major problem with this structure is that increasing the number of feature channels leads to a rapid increase in the number of parameters. This paper proposes a DRB module as shown in Fig. 2b, which consists of inner unit (IU) and external unit (EU), and uses a dilation-then-compression strategy [24]. This strategy can reduce the number of channels and thus reduce the number of parameters. The first convolutional layer of EU extracts feature information and expands the feature channel to obtain richer image feature information. The second convolutional layer compresses feature channels, filters feature information, and promotes effective feature information transfer. In this paper, an IU containing two 1 × 1 convolutions was added inside the EU to increase the number of channels without causing a sharp increase in the number of parameters.

Fig. 2
figure 2

Structure of the different residual blocks

Assume that the input and output feature information of the \(i^{th}\) DRB module is and \(F_{i}\), respectively.

$$ F_{i} = F_{ACW} \left( {W_{{EU_{2} \sigma }} \left( {H_{IU} \left( {W_{{EU_{1} }} F_{i - 1} } \right)} \right)} \right) + F_{i - 1} $$
(10)

The kernel size of the two convolutional layers in EU is 3 × 3, and the weights are \(W_{{EU_{1} }}\) and \(W_{{EU_{2} }}\), respectively (ignoring the bias term). \(H_{IU}\) is the inner unit operation, \(\sigma \left( \cdot \right)\) is the ReLU activation function, \(F_{ACW}\) is the autocorrelation weight unit, and the first convolutional layer with weight \(W_{{EU_{1} }}\) processes \(F_{i - 1}\) to generate feature information \(F_{input}\), which is input to the inner unit for processing. After the inner unit is processed, the feature information \(F_{output}\) is generated. After processing through the convolution layer, the autocorrelation weight unit, and the skip connection, the output result of the EU is the feature information \(F_{i}\) extracted by the \(i^{th}\) DRB module.

IU adopts the structure of Conv-ReLU-Conv and adds skip connections.

$$ F_{output} = F_{ACW} \left( {W_{{IU_{2} \sigma }} \left( {W_{{IU_{1} }} F_{input} } \right)} \right) + F_{input} $$
(11)

Among them, \(W_{{IU_{1} }}\) and \(W_{{IU_{2} }}\) are the weights of the two convolutional layers of the inner unit (ignoring the bias term), and \(\sigma \left( \cdot \right)\) is the ReLU activation function. The input of the inner unit comes from the output \(F_{input}\) of the first convolutional layer of the outer unit, and after processing through the convolutional layer, the autocorrelation weight unit and the skip connection, the extracted feature information bit \(F_{output}\).

2.2 Autocorrelation weight unit (ACW)

The SISR model based on the deep residual structure still has the problem of vanishing or exploding gradient. To stabilize the training, the residual scale parameter is usually introduced [25]. This hyperparameter is usually set empirically and is difficult to optimize. The ACW module adopts the learnable optimal residual scale parameter.

The structure of ACW module is shown in Fig. 3. The ACW module consists of two parts, the global pooling layer, and the sigmoid function, with no additional parameters. The global pooling layer encodes all input feature information as initialized weights, which are then adjusted to \(\left[ {0,1} \right]\) using the sigmoid function. Due to the difference between the feature information, different weights are generated, which enhances the feature information effective for the reconstructed image.

Fig. 3
figure 3

Structure of autocorrelation weight unit (ACW)

Let \(X{ = }\left[ {x_{1} ,x_{2} ,...,x_{C} } \right]\) be the input feature information and the size are \(H \times W \times C\). The initialization weight \({\text{Z = }}\left[ {z_{1} ,z_{2} ,...,z_{C} } \right]\) is calculated from the input feature information \(X\) through the global average pooling layer \(H_{GAP}\). The initial weights of the \(c^{th}\) input feature information are as follows:

$$ z_{c} = H_{GAP} \left( {x_{c} } \right) = \frac{1}{H \times W}\sum\limits_{i = 1}^{H} {\sum\limits_{j = 1}^{W} {x_{c} \left( {i,j} \right)} } $$
(12)

Use the sigmoid activation function \(f\left( \cdot \right)\) to adjust the initialization weight \(Z\) to generate the final weight parameter \(W\).

$$ W = f\left( Z \right) $$
(13)

Weighting the input feature information:

$$ \hat{X} = X \cdot W $$
(14)
  1. C.

    Multi-Path Reconstruction Unit (MPRU)

At the end of the current SISR model network, most of them use transposed convolution or sub-pixel convolution for upsampling operations. Compared with transposed convolution, the image quality of sub-pixel convolution reconstruction is better [26], but it needs to use multiple 3 × 3 convolution layers [17, 19, 21], the magnification factor increases, and the number of parameters will increase. To reduce the number of parameters without reducing the image quality, the multi-path reconstruction unit (MPRU) module has designed. The MPRU module is shown in Fig. 1a. MPRU has three reconstruction branches, each of which consists of a 1 × 1 convolutional layer and a sub-pixel convolutional layer. The reconstruction result of each branch is the same size as the HR image, and the final SR image is the sum of the reconstruction results of the three branches. The MPRU module uses a 1 × 1 convolutional layer, which can greatly reduce the parameters and increase the amplification factor without significantly increasing the number of parameters. At the same time, the MPRU module obtains the feature information of each branch, which can also improve the reconstruction effect.

\(F_{S} { = }\left[ {F_{{S_{1} }} ,F_{{S_{2} }} ,F_{{S_{3} }} } \right]\) is the shallow feature information extracted by the shallow feature mapping unit, \(F_{D} { = }\left[ {F_{{D_{1} }} ,F_{{D_{2} }} ,F_{{D_{3} }} } \right]\) is the hierarchical feature information extracted by the deep feature mapping unit, and the \(i^{th}\) branch in the multi-path reconstruction unit is generated as follows:

$$ I_{{_{i} }} = H_{{UP_{i} }} \left( {H_{{Conv_{i} }} \left( {\left[ {\gamma_{i} F_{{S_{i} }} ,\beta_{i} F_{{D_{i} }} } \right]} \right)} \right) $$
(15)

\(H_{{UP_{i} }}\) and \(H_{{conv_{i} }}\) represent the sub-pixel convolutional layer and \(1 \times 1\) convolutional layer of the \(i^{th}\) branch, respectively, and \(\gamma_{i}\) and \(\beta_{i}\) are the adaptive weights of \(F_{{S_{i} }}\) and \(F_{{D_{i} }}\), respectively. \(I_{i}\) is the reconstructed image of the \(i^{th}\) branch. \(\left[ \cdot \right]\) is a concatenate operation on \(F_{{S_{i} }}\) and \(F_{{D_{i} }}\). The MPRU unit has three branches, so \(i = \left[ {1,2,3} \right]\).

$$ I_{SR} { = }I_{1} { + }I_{2} { + }I_{3} $$
(16)

The multi-path reconstruction unit adds the images \(\left( {I_{1} ,I_{2} ,I_{3} } \right)\) generated by all the branches to generate the final result \(I_{SR}\).

  1. D.

    Loss Function

In this paper, the \(L_{1}\) loss function is used to optimize the MFFN. The \(L_{1}\) loss function is mainly used to calculate the average absolute value of the difference between each pixel of the input image and the target image, and is a widely used loss function in the field of image super-resolution.

For a given training dataset \(\left\{ {I_{LR}^{i} ,I_{HR}^{i} } \right\}_{i = 1}^{N}\) containing \(N\) low- and high-resolution image pairs, the objective of the network in this paper is to train the images and to optimize them using the \(L_{1}\) loss function, axiomatically shown as follows:

$$ L\left( \theta \right){ = }\frac{1}{N}\sum\limits_{i = 1}^{N} {\left\| {H_{MFFN} \left( {I_{LR}^{i} } \right) - I_{HR}^{i} } \right\|_{1} }, $$
(17)

wherein \(H_{MFFN} \left( {} \right)\) denotes the network reconstruction result, \(\left\| {_{{}} } \right\|_{1}\) denotes the \(L_{1}\) parametrization, and \(\theta\) denotes the parameters in the network.

3 Experimental analysis and results

  1. E.

    The Datasets and Metrics

In this paper, we choose DIV2K [20] dataset as the training dataset of the network, which consists of 800 images of the training dataset and 100 images of the validation dataset. To test the effect of the proposed model, we had used five benchmark datasets, namely Set5 [30], Set14 [30], BSD100 [31], Urban100 [32] and Manga109 [33]. Among them, the test dataset BSD100 contains images of various style types, Urban100 is images of various types of buildings, and Manga109 is images of various types of cartoons.

These five test datasets have rich and diverse information, which can well-verify the effectiveness of super-resolution methods. To evaluate the super-resolution performance, this paper uses two commonly used full-reference image quality assessment criteria to evaluate the differences: peak signal-to-noise ratio (PSNR) [34] and structural similarity index (SSIM) [34]. Following the super-resolution convention, the luminance channel is chosen for full-reference image quality assessment because the intensity of an image is more sensitive to human vision than chromaticity.

  1. F.

    The experimental details

In each training round, this paper cuts the low-resolution RGB images and the corresponding high-resolution RGB images into blocks of 48 × 48. The training data are increased by randomly rotating 90 degrees, 180 degrees, 270 degrees, and flipping horizontally. In this paper, the number of dense blocks in the stacked pooled attentional dense blocks is set to, in each pooled attentional dense block, there are three residual dense blocks and three pooled attentional blocks in this paper. The growth rate of the residual dense blocks is 32, the number of channels not specified in the paper is 64, and the final output of the network is 3.

The Adam optimizer [27] can adaptively adjust the learning rate for different parameters, which can effectively improve the convergence speed of the model and thus reach the optimal point faster, so the Adam optimizer is used for the experiments. The initial learning rate of the network is set to 2 × 10–4, and the learning rate is halved for every 2 × 105 iterations. The proposed method is implemented in a hardware environment with an Intel i9-9900 K (3.6 GHz), 8 GB of RAM, and NVIDIA GeForce RTX 2080Ti GPU. The software environment is 64-bit Ubuntu operating system and PyTorch platform.

  1. G.

    Experiment results and analysis

  1. 1.

    Residual group analysis

The proposed model contains three residual groups, and each group contains same number of double-nested residual blocks. To verify the influence of different numbers of double-nested residual blocks in residual group on the model, we performed four times super-resolution experiments on the Set5 standard test dataset and the DIV2K-10 dataset, with the number of groups in each group being 5, 6, 7 of the residual group for comparative experiments, as shown in Table 1. When the number of comparison groups is 5, although the number of parameters is increased by 0.24 M, the PSNR index is increased by 0.03 dB on the data of Set5 and DIV2K-10, respectively. When the number of groups is 7, the improvement of the PSNR index is not obvious, but the mount of parameters is increased. 0.24 M.

  1. 2.

    SFMU analysis

Table 1 Average PSNR and number of parameters of different DRB in residual group on Set5 and DIV2K-10 with scaling factor × 4 in 200 Epochs

To verify the influence of the convolution kernel size on the different paths of the shallow feature mapping unit and the non-use of the shallow feature mapping unit on the model, we performed a fourfold superimposition on the Set5 standard test dataset and the DIV2K-10 dataset. Resolution comparison experiment. As shown in Table 2, the shallow feature mapping unit is not used, so the model cannot use the shallow feature information, resulting in a poor reconstruction effect. When each branch uses the same size of the convolution kernel and the size of the convolution kernel increases continuously, the reconstruction effect is improved, but the amount of parameters also increases. When the branch convolution kernels are all 1, the PSNR index is lower than the model that does not use the shallow feature mapping unit, which is due to the less feature information extracted and redundant information, resulting in a poorer effect. We find that the best results are obtained when the three branch convolution kernels are set to 1, 3, and 5, respectively. This is because each branch can extract different levels of shallow feature information, which can be effectively combined with deep-level feature information. Therefore, we use shallow feature mapping units with branch convolution kernel sizes of 1, 3, and 5, respectively.

  1. 3.

    DRB analysis

Table 2 Average PSNR of different SFMU convolution kernel on Set5 and DIV2K-10 with scaling factor × 4 in 200 epochs
Table 3 Influence of residual blocks with different structures on model performance. We report the average PSNR on Set5 and DIV2K-10 with scaling factor × 4 in 200 epochs

Compared with the popular residual block structure, as shown in Fig. 2(a), the DRB module in this paper has advantages in performance and parameter quantity. This paper conducts a fourfold super-resolution comparison experiment on the Set5 and DIV2K-10 datasets. We construct two test models, called Model I and Model II, respectively. Model I is an EDSR structure, but the number of input and output channels of the convolutional layer is reduced from 256 to 64, and the number of residual blocks is reduced from 32 to 18. Model II replaces the residual block in the model I with a DRB module. In the DRB expansion stage, the number of EU input and output feature channels is set to 32 and 64, respectively, and the number of IU input and output feature channels is set to 64 and 128, respectively; in the compression stage, the number of EU input and output channels is set to 64 and 32, respectively, and the number of IU input and output channels is set to 64 and 32, respectively. The number of feature channels is set to 128 and 64, respectively. The other parameters in Model I and Model II are the same: The residual scale parameter is 0.1, running 200 iterations, the PSNR of Model II is 0.01 dB and 0.05 dB higher than that of Model I, respectively, and the parameter amount of the DRB module is higher than that of the ERSR residual block. 20.3 K down.

  1. 4.

    ACW analysis

To verify the effectiveness of the ACW module, two cases of the model I including and not including the ACW module are considered, and the results are shown in Table 4. After using the ACW module, the PSNR is improved by 0.12 dB and 0.13 dB on Set5 and DIV2K-10 datasets, respectively. The experimental results show that the ACW module is effective in automatically learning the optimal residual scale parameters.

  1. 5.

    MPRU Analysis

Table 4 Effects of ACW. We report the average PSNR on Set5 and DIV2K-10 with scaling factor × 4 in 200 Epochs

To verify the reconstruction performance of MPRU module, after replacing the EDSR reconstruction unit of the model I with the MPRU module, it is compared with the model I. The results of the two models correspond to Table 5, respectively. The parameter quantity of the MPRU module is only 99.36 K, which is only about 1/3 of the parameter size of EDSR reconstruction unit. In the Set5 and DIV2K-10 datasets, the MPRU module improves PSNR by 0.12 dB and 0.15 dB, respectively.

  1. H.

    Model analysis

Table 5 Effects of MPRU. We report the average PSNR on Set5 and DIV2K-10 with scaling factor × 4 in 200 Epochs

The model in this paper verifies × 2, × 3, × 4 and × 8 super-resolution factor on Set14, B100, Urban100, and Manga109 datasets. We had chosen SRCNN [7], FSRCNN [35], VDSR [9], DRCN [11], LapSRN [18], DRRN [12], CBPN [16], IMDN [14], BFDN [42], BSRN [41], Swin-IR [39], MM-RealSR [38], and LDL [36] for performance comparison with MFFN.

The objective indicators are shown in Table 6. In the case of the × 2 factor on the Set14 dataset, the PSNR of the model in this paper is 0.27 dB higher than that of the BFDN model, which is similar in the case of other amplification factors, and the number of parameters of the model in this paper is less than that of BFDN (about 120 K less). On all test datasets, when the amplification factor is four times, the PSNR of our model is 0.06 dB and 0.16 dB higher than that of MM-RealSR model and LDL on average. Although the performance of BSRN is slightly better on some datasets, the comprehensive performance of the model in this paper is better, and the four times the number of parameters of the BSRN model is four times that of the model in this paper.

Table 6 Average PSNR/SSIM of various SISR methods

In order to compare the reconstruction performance of different super-resolution methods in terms of visual quality, Fig. 4 and Fig. 5 show the super-resolution reconstruction results of “Img048” and “Img092” images in Urban100 at × 4 factor, respectively. Figure 6 and Fig. 7 show the super-resolution reconstructions of “223,061” and “253,027” from dataset B100 at × 4 factor, respectively. The ground truth represents the original HR image. In order to highlight the contrast effect, a local area of the image is selected for magnification using the double triple interpolation method. By observing Fig. 5 and Fig. 7, it can be seen that although the MM-RealSR [38] can clearly recover the significant texture information in the image, texture information has obvious orientation problems, while Swin-IR [39] and BSRN [41] can recover correct texture information to some extent, but it is difficult to suppress the wrong texture, and the textures of these two methods are more blurred.

Fig. 4
figure 4

Super-resolution results of “Img048” in Urban100 dataset for × 4 factor

Fig. 5
figure 5

Super-resolution results of “Img092” in Urban100 dataset for × 4 factor

Fig. 6
figure 6

Super-resolution results of “223,061” in BSD100 dataset for × 4 factor

Fig. 7
figure 7

Super-resolution results of “253,027” in BSD100 dataset for × 4 factor

In contrast, the method in this paper is able to produce correctly oriented textures and sharper edges on locally zoomed-in regions of the graph, and is more consistent with human vision. This is due to the strong feature alignment capability of the deformable convolution in SFMU module, which enables the network model to recover more correctly the complex texture structures in different images. It can be clearly observed in most of local zooms that the details of the images reconstructed by other methods are blurred, and even the edge information of the images cannot be reconstructed, while the details reconstructed by the method in this paper are much clearer and have better recognition. These results in Fig. 6 also indicate that the proposed method achieves better results in terms of subjective performance.

  1. I.

    Ablation experiments

In order to verify the effectiveness of SFMU and DFMU, ablation experiments are conducted in this paper to verify the superiority of our model in the test dataset Set5 for the case of image magnification times.

The convergence process of these five networks is given in Fig. 8. In this paper, 18 RRDB blocks are chosen as the baseline, and these five networks have the same number of RRDBs. When the SFMU module and DFMU module are added to the baseline respectively in this paper, two curves, Baseline + SFMU and Baseline + DFMU, are obtained. Thus, it is verified that both modules can effectively improve the performance of the baseline. When the DRB is removed from the DFMU module in this paper, the curve Baseline + DFMU_no_DRB is obtained. Comparing the curve Baseline + SFMU, we can see that after losing the DRB constraint, although the network converges faster, the final PSNR decreases by 0.03 dB, but it is still 0.04 dB higher than the baseline network, thus verifying that the effectiveness of the SFMU and DFMU modules. When two modules are added to the baseline network simultaneously in this paper, the curve Baseline + SFMU + DFMU is obtained. It can be seen that the combined performance of the two modules is better than that of only one module. These quantitative and visual analyses demonstrate the effectiveness of SFMU and DFMU.

Fig. 8
figure 8

Convergence analysis on SFMU and DFMU. The curves for each combination are based on the PSNR on Set5 with × 4 factor in 800 epochs

Table 7 Results of SFMU and DFMU module with × 4 scale on Set5

Table 7 gives the experimental results for the case when the network contains one or both of the SFMU module and the DFMU module. From the table, it can be seen that when the network of this paper contains both SFMU and DFMU modules, the values improve by 0.07 dB and 0.05 dB, respectively, compared to the cases when only SFMU and DFMU modules are included, and the maximum value is obtained at SSIM.

To better demonstrate the effect of the MPRU module in the network, the feature maps containing only the shallow feature extraction and the feature maps with the MPRU module added are visualized in this paper, where Fig. 9(a) represents the results of the network output at the first convolution layer, and Fig. 9(b) and Fig. 9(c) represent the results of the DFMU module output and the MRPU module output, respectively. From Fig. 9(b) and Fig. 9(c), it can be seen that the DFMU module learns a large number of self-similar features of the image, for example, the circular spots on the butterfly are well-recovered. The MRPU module, on the other hand, learns more details of the image texture. The experimental results show that the two modules in the network of this paper play a good role in feature enhancement.

  1. J.

    Parameter and time analysis

Fig. 9
figure 9

Results of each module in the network

In order to further verify the effectiveness of the proposed model, MFFN is analyzed and compared with some current deep learning super-resolution methods that are recognized to achieve better results, including IMDN [14], LDL [36], Swin-IR [39], and BSRN [41], in terms of the number of parameters and the computational volume, and the results of parameters and computational volume are shown in Table 8. The table shows that MFFN achieves better objective metrics while significantly scaling down the number of parameters and the computational effort of the network. The number of MFFN model parameters is approximately equal to 53% of the number of IMDN and Swin-IR parameters and 53% of their computational effort when scaled up by a factor of × 2 on the Set14 dataset, but the PSNR and SSIM results obtained are very similar. Although the number of parameters and computational effort of MFFN are slightly higher than those of the BSRN method, the obtained PSNR and SSIM values are improved by 0.07 dB and 0.0022 compared to the BSRN method.

Table 8 Comparison of parameter size and computational cost on Set5

This demonstrates that MFFN achieves a better balance between image reconstruction quality and model compression as well as computational efficiency, i.e., MFFN can obtain better PSNR and SSIM results even with fewer parameters. MFFN achieves similar reconstruction quality when compared with the LDL method, which is currently superior in objective metrics, but MFFN parameters are much less.

4 Conclusion

This paper proposes a lightweight multi-level feature fusion network for reconstructing high-quality super-resolution images. In this paper, a double-layer nested residual block (DRB) is designed to extract image feature information. The number of feature channels is first expanded and then compressed, and convolution layers with different receptive fields are used to reduce the number of parameters. To effectively transmit feature information in the double-nested residual block, this paper designs the autocorrelation weight unit (ACW), which generates weight information by calculating the feature information, and then uses the weight information to weigh the feature information to ensure that the high-weight feature information is processed. effective delivery. In this paper, two-layer nested residual blocks are formed into residual groups to extract deep hierarchical feature models, and shallow feature mapping unit (SFMU) is constructed to extract multi-scale and multi-level shallow feature information. The multi-path reconstruction unit (MPRU) fuses the deep layer feature information with the shallow layer feature information to reconstruct a high-quality super-resolution image. The experimental results show that the above module design helps reconstruct high-quality images, and the proposed model can effectively enhance image stripes and reconstruct high-quality super-resolution images. Compared with other lightweight models, our model achieves a better balance between performance and model scale.