1 Introduction

Single image super-resolution (SISR) aims to reconstruct a low-resolution image into a high-resolution image, which is a low-level visual recovery task in computer vision [5, 13]. Since the reconstruction of a high-resolution image from a low-resolution image is not just a one-to-one mapping, single image super-resolution (SISR) is a seriously ill-posed problem. Various SISR methods have been proposed and achieved amazing results, among which the most notable ones are some methods based on deep learning [2, 7, 8, 20, 33, 38, 40].

With the development of a convolutional neural network, single image super-resolution has been granted significant attention by researchers, who have achieved state-of-the-art performance on various benchmarks of SR. Dong et al. [7] first applied the convolutional neural network to the super-resolution task.Then, they proposed the FSRCNN [8], which can directly learn the low-resolution inputs and then up-sample the features at the end of the network. In order to improve the performance of the model, Kim et al. proposed the VDSR [15] by increasing the depth of the network model. Inspired by the image classification problem [9], a lot of work has been done to apply residual learning to SR problems. As a result, many models with good performance have relatively large parameters. In order to reduce the parameters of the model, the recursive network model is applied to the SR problem. The recursive model decomposes the complex SR problem into a series of simple and easily solved problems by sharing parameters. Many researchers have taken a recursive network as their basic network architecture, such as DRCN [16], DRRN [28], and MemNet [29]. All of these models guarantee better performance with fewer parameters (Fig. 1).

Fig. 1
figure 1

Trade-off between performance and number of parameters on the Urban100×2 dataset. The orange circle represents the method we proposed, and the blue circle represents the other methods

With the improvement of computing performance, many super-resolution networks [20, 33, 38, 40] have large network parameters and large computing overhead. it is difficult to apply this process in real life. Currently, there are many ways to design lightweight SR networks. Recursive networks can achieve better SR performance with fewer model parameters, but they require huge computational overhead. The CARN network [2] adopts the local and global cascading modules, which make full use of the feature information at all levels but can’t avoid information redundancy. Network structure search [6] can give full play to the performance of each module, although the model based on network structure search does not improve the performance of SR very much, as it is limited in terms of search space and search strategy. After learning that the residual network improves the performance of SR [20], we noticed that the residual block promoted the performance of SR by integrating the extracted features with the original features that are not extracted. Because the original features can reach all hierarchical structures of the network through short or long connections, the incremental features extracted from the input image account for a large part of the original features that can be used repeatedly. This also limits the network’s SR performance. In order to make better use of the feature information, we adopt the feature separation technology to separate the extracted features and the unextracted features and then merge them to continuously update the original features information at different levels. Therefore, we propose a feature separation and fusion module, which reduces feature information redundancy. Our network is of great significance in the design of lightweight network models. We can learn from VDSR [15] that increasing the network depth can improve the SR performance of the network, so we adopt the idea of partitioning. As shown in Fig. 2, several feature separation and fusion modules and a convolutional layer constitute our feature residual learning module, and several feature residual learning modules and a convolutional layer constitute a global residual learning module. So our network can integrate and learn features at different levels to increase the richness of the features.

Fig. 2
figure 2

Network architecture of our feature separation and fusion network (FSFN)

In this work, we propose a feature separation and fusion network for SISR. Compared to networks with similar model parameters and multi-adds, our network showed better SR performance. The contributions of this paper can be summarized as follows:

  • We propose a feature separation and fusion module, which separates the original features into extracted features and unextracted features, and then extracts the unextracted features in the next step to increase the diversity of features. Finally, we use a 1x1 convolution to adaptively select fusion features. This makes the fusion feature more representative, which means the image reconstruction quality can be greatly improved. The experimental results show that the proposed feature separation and fusion module improves the performance of SR.

  • We adopt the idea of partitioning, which increases the depth of the network and enables us to better integrate and learn the characteristics of other modules. This approach further improves the SR performance of our network.

  • An extensive experimental evaluation of several publicly available datasets shows that the proposed FSFN model performs better than most existing methods.

2 Related works

In the early stage, in order to solve the problem of image super-resolution, we mostly adopted interpolation technology based on sampling theory [3, 19, 42]. With the development of deep learning technology, the problem of image super-resolution is now mostly solved. Therefore, our main research focus is on the application of deep learning technology in image super-resolution.

2.1 Single image super-resolution

By using a variety of deep learning-based techniques, we have been able to find a solution to the SR problem experienced in the extensive literature on this topic [12, 14, 18, 21, 25, 26, 33,34,35, 41]. Dong et al. proposed SRCNN [7], which was the first successful attempt to use convolution to solve the problem of image super-resolution. Then, they came up with FSRCNN [8], which had better SR performance. In contrast to the shallow network architecture, Simonyan et al. proposed VDSR [15], which has a deeper hierarchical structure. Kim et al. proposed a deep recursive convolutional network named DRCN [16], which, by designing a repeatable convolution unit, enables the model to maintain better SR performance with fewer parameters. Inspired by the success of DenseNet [10] in the image classification architecture, Tong et al. proposed the SR-DenseNet [31], which achieved higher flexibility and richer feature representation through the densely connected CNN layer. Zhang et al. proposed the RDN [40] by introducing local and global residual connections. Then they proposed RCAN [38] by introducing a channel attention mechanism in each local residual block, which improved the performance of SR significantly. Ahn et al. proposed CARN [2], which allows the feature information of each residual block to flow between different levels through a large number of short connections. This model performs well in lightweight networks. Liu et al. proposed a residual feature aggregation network (RFANet) [22] consisting of an RFA framework and a powerful ESA block. The RFA framework groups several residual modules together and directly forwards the features on each local residual branch by adding skip connections. This work also effectively improves the SR performance of the network. Due to the uncertainty of image degradation, Zhang et al. [37] proposed an end-to-end trainable unfolding network which leverages both learning-based the methods and model-based methods. And it can handle the classical degradation model via a single model. This work expands the data processing scope of the network beyond the limitation of bicubic interpolation degradation.

2.2 Residual learning

Residual learning is now widely used in various computer vision tasks. It was originally proposed to avoid gradient disappearance and make it possible to design very deep networks. In the case of SR, residual learning mainly deals with the high-frequency information between the input and the ground truth. The processing of high-frequency features information will be an important factor affecting the SR performance. Lim et al. proposed EDSR [20] by modifying the ResNet architecture for image classification. They greatly improved the SR performance of the model by removing the batch normalization layer. In order to further improve SR performance, we will separate and fuse the high-frequency feature information in the residual block to improve the hierarchical and richness of the high-frequency features. Our network also showed better SR performance.

3 Proposed method

In this section, we describe our proposed feature separation and fusion network (FSFN) in detail.

3.1 Network structure

As shown in Fig. 2, our feature separation and fusion network is mainly composed of three parts: the shallow feature extraction module, the global residual learning module, and the up-sampling reconstruction module. Given an input LR image ILR and its corresponding target HR image IHR. super-resolution image ISR can be obtained by

$$ I_{SR} = F_{FSFN}(I_{LR}) $$
(1)

where FFSFN(⋅) is our FSFN. Like most previous studies, our shallow feature extraction module only uses a convolution layer, described as

$$ sf = F_{SFEM}(I_{LR}) $$
(2)

where FSFEM(⋅) is represented as our shallow feature extraction module and sf represents the shallow features extracted from the input ILR. Shallow features (sf ) are used as inputs to the global residual learning module to generate more refined features (rf). As such,

$$ rf = F_{GRLM}(sf) $$
(3)

where FGRLM(⋅) denotes our proposed global residual learning module. Finally, rf is used as the input of the up-sampling reconstruction module (URM) to generate the super-resolution image ISR. So

$$ I_{SR} = F_{URM}(rf) $$
(4)

where FURM(⋅) denotes our proposed up-sampling reconstruction module, which consists of the upscale module and a convolution layer. As narrative ESPCN [27] institutes, we choose to sub-pixel convolutional as our upscale module. This has been proved to be the most effective option.

Finally, FSFN is optimized using L1 loss function, just like in other networks [2, 20, 41]. Given a training set \(\{{I^{i}_{LR},I^{i}_{HR}}\}^{N}_{i=1}\) that has N LR-HR pairs, the loss function of our FSFN network can be expressed as follows:

$$ L(\theta)=\frac{1}{N}\sum\limits_{i=1}^{N}\left|\left|F_{FSFN}(I^{i}_{LR}) - I^{i}_{HR} \right| \right|_{1} $$
(5)

where θ represents the updated parameters in the training process of our model and \(\left |\left |\cdot \right |\right |_{1}\) is the l1 norm.

3.2 Global residual learning module(GRLM)

In this section, we will describe the core module of the network in detail, which is the global residual learning module referred to as the GRLM (see Fig. 2). The GRLM consists of N feature residual learning modules (FRLM), a convolutional layer, and a long connection. This was inspired by the EDSR [20], which allows the network to centralize the processing of the high frequency parts of the feature. Similar to the RCAN [38], our FRLM module is divided into M feature residual learning modules, a convolutional layer, and a short connection. This enables the shallow feature (sf) extracted by the shallow feature module to be applied at a deeper level. The GRLM module can be described by the following formula:

$$ \begin{aligned} rf = & F_{conv}(ff_{n}) + sf\\ ff_{j} = & F_{FRLM_{j}}(ff_{j-1})\qquad (j = 1,2,...,n)\\ ff_{0} = & F_{FRLM_{0}}(sf) \end{aligned} $$
(6)

In the formula, Fconv(⋅) represents a convolutional layer, rf represents a refined feature, and ffjrepresents the output of the i th feature residual learning module(\( F_{FRLM_{j}}\)), which is defined as the fusion feature. The FRLM can be described as follows:

$$ \begin{aligned} ff_{j+1} =& F_{conv}(f_{m}) + ff_{j}\\ f_{i} =& F_{FSFM_{i}}(f_{i-1})\qquad (i = 1,2,...,m)\\ f_{0} =& F_{FSFM_{0}}(ff_{j}) \end{aligned} $$
(7)

where fi represents the input features of the (i − 1)th feature fusion module(\(F_{FSFM_{i-1}}\)) and the output features of the i th feature fusion module(\(F_{FSFM_{i}}\)). We will discuss the FSFM in more detail in the next section.

3.3 Feature separation and fusion module (FSFM)

We can learn from the residual block proposed by EDSR [20] that focusing on the processing of high-frequency features will lead to a strong SR performance. As we can see in Fig. 3, the processing of high-frequency characteristics only uses two convolutional layers. In view of the limited feature extraction capability of a single convolution, we propose a feature separation and fusion module. We will retain the feature information extracted by the first convolution and the feature information not yet extracted. Then, we re-extract the unextracted features. We define this module as a local feature extraction module (LFEM) which is shown in Fig. 4. This module can be described as follows:

$$ \begin{aligned} euf_{i} = F_{LFEM}(uf_{i}) \end{aligned} $$
(8)

In the formula, ufi represents unextracted features and eufi represents features of extracted ufi. So, our feature separation module (FSM) can be expressed as

$$ \begin{aligned} ef_{i}, euf_{i} =& F_{FSM}(f_{i})\\ =& F_{conv}(f_{i}),F_{LFEM}(f_{i} -F_{conv}(f_{i})) \end{aligned} $$
(9)

where fi represents the input feature of the i th FSFM. eufi and efi will be combined adaptively by the feature fusion module (FFM). Therefore, our FSFM can be expressed as

$$ \begin{aligned} f_{i+1} =& F_{FSFM}(f_{i})\\ =& F_{FFM}(F_{FSM}(f_{i})) + f_{i} \end{aligned} $$
(10)

where FFSFM(⋅) denotes our proposed feature separation and fusion module. Compared with the residual block proposed by EDSR [20], our FSFM behaves like a residual block which means that the associated path is disconnected assuming that LFEM is not effective. We can understand that the residual block is a special case of FSFM. Since our FSM module uses only a 1×1 convolutional layer, the increase in network volume can be ignored. As shown in Figs. 5 and 6, the number of indistinctive feature maps accounts for more than half among the unextracted feature maps. This fully shows that our network can effectively reduce feature redundancy.

Fig. 3
figure 3

The structure of the residual blocks

Fig. 4
figure 4

Decomposition of feature separation and fusion module (FSFM); (a) represents the feature separation module (FSM), while (b) represents the feature fusion module (FFM)

Fig. 5
figure 5

extracted features in the first feature separation and fusion module

Fig. 6
figure 6

Unextracted features in the first feature separation and fusion module

4 Experiments

4.1 Datasets and metrics

The DIV2K dataset [1] has been widely used for many image restoration tasks. Like these networks [2, 20, 38], we used DIV2K as our FSFN training dataset which contained 800 RGB images. Five commonly used datasets, Set5 [4], Set14 [36], BSD100 [23], Urban100 [11], and Manga109 [24], were used to evaluate network SR performance. To gauge the SR results, we applied two objective image quality assessment criteria: peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [32], All criteria were calculated on the Y channel of transformed YCbCr space.

4.2 Implementation details

The HR image patch with a size of 192×192 was randomly cropped from the HR image in the DIV2K dataset as the input of our model. The LR image was obtained from the HR image by using the bicubic interpolation according to the scaling factor (2×,3×,4×). And the mini-batch size is set to 16. We set filter size f = 3. The number of filters for FSFN_SP is set to 32 and for FSFN to 48. We train our FSFN with an ADAM optimizer [17] by setting β1 = 0.9, β2 = 0.999 and ε = 10− 8. The learning rate is initialized to 10− 4 and halved at every 2 × 105 minibatch update. We use a PyTorch framework to implement our proposed FSFN network with a Titan Xp GPU.

4.3 Model analysis

In this section, we will delve into the model parameters and calculations, the effectiveness of FSFM, and the superiority of the idea of block processing (BP).

4.3.1 Model parameters and calculations

As we mentioned in Section 3.2, our core module GRLM is composed of N FRLM, and each FRLM is composed of M FSFM which is the smallest indivisible unit of our network. In our proposed FSFN network, we set N to be 3 and M to be 6. In contrast to most networks [2, 20, 39], in our proposed network, the number of feature maps is 48. This greatly reduces the parameters of the model. At the same time, we also proposed a small version of FSFN named FSFN_N2M4. In FSFN_N2M4, we set N to 2 and M to 4. Then, the number of feature maps is 32. In order to further enhance the performance of the small version of FSFN, we proposed FSFN_SP, which changed the number of output feature maps of convolution of LFEM in FSFM to 64. In this way, the SR performance of the model is further improved at the cost of a small parameter increase (Table 1). Table 2 shows the comparison of models under different amounts of FRLM and FSFM. It can be seen that all the models we selected can show better SR performance under the condition of limited number of parameters. In order to enhance the quality of the SR images, we adopt the self-ensemble strategy. This strategy can be summarized as follows. We flip and rotate the input image ILR to generate augmented image \(I_{LR}^{n,i} = T_{i}(I_{LR}^{n})\) for each sample, where Ti represents eight geometric transformations, including indentity. Then, we’ll use augmented images as input to our network to generate super-resolved images \(I_{SR}^{n,1},...,I_{SR}^{n,8}\). We then apply inverse transform to those super-resolved images to get the original geometry \(\tilde {I}_{SR}^{n,i} = T_{i}^{-1}(I_{SR}^{n,i})\). Finally, we average the transformed outputs to get the following self-ensemble results. \(I_{SR}^{n} = \tfrac {1}{8}{\sum }_{i=1}^{n}(\tilde {I}_{SR}^{n,i})\). FSFN_SP+ and FSFN+ are obtained by applying the self-ensemble strategy. As can be seen in Fig. 1, our model performs best against other state-of-the-art algorithms on the parameter scale of 500K and 1000K. We all know that recursive networks greatly increase the number of model parameters by reusing modules, but they also increase the computational complexity of the model. As we can see from Table 1, our FSFN_SP has a significantly reduced computation capacity but better SR performance when compared with DRCN [16], DRRN [28] and MemNet [30].

Table 1 Average PSNR/SSIM for scales 2, 3, and 4 on datasets Set5, Set14, BSD100, Urban100, and Manga109
Table 2 PSNR results of the models with different numbers of FRLM and FSFM on Urban100 at 500 epochs (with scaling factor × 4)

4.3.2 Block processing (BP) and feature separation and fusion module (FSFM)

As discussed in Section 2.2, residual learning enables our network to focus on high-frequency processing of features. EDSR [20] proposes a network architecture similar to that shown in Fig. 12a, and its smallest modular processing unit is called the residual block (Resblock). We call this structural framework the directly connected structure. The disadvantage of this structure is that, as the network hierarchy deepens, the shallow primitive features that can be utilized effectively will be ignored. To address this, Zhang et al. [39] put forward the Residual in Residual (RIR) module similar to Fig. 12b. We call this structural framework block processing (BP). Through short connection, the M smallest module processing unit is synthesized into a large module, and then N such large modules are synthesized into a larger module. This allows the original shallow features to be utilized effectively at different levels. Therefore, we think that block processing is a better structural framework. As we can see from Table 3. whether the smallest structural unit is Resblock or FSFM, when block processing is adopted, the SR performance of the network is improved. EDSR_S represents the structural framework with a direct connection structure, which contains eight residual blocks. and EDSR_N2M4 represents the structural framework with block processing. We found that EDSR_N2M4 is 0.023dB higher than EDSR_S in PSNR. By comparing EDSR_S and FSFN_S as well as FSFN_N2M4 and EDSR_N2M4, we found that, on the same structural framework, our FSFM improved SR performance more significantly than Resblock. By comparing EDSR_N2M4 with FSFN_S, we find that FSFM improves network performance to a greater degree than block processing and that FSFN_S achieves better performance with fewer parameters. As described in Section 4.3.1, we obtained FSFN_SP through the Extended Local Feature Extraction Module. FSFN_SP performs best on those networks with parameters below 500K. Although the average running time of FSFN_SP on the Urban100 dataset was slightly increased compared to our base network EDSR_S, we chose FSFN_SP as the smaller version of our FSFN network for better SR performance.

Table 3 Investigations of FSFM, Block Processing (BP), and Extended Local Feature Extraction Module (ELFEM)

4.4 Comparison with the state-of-the-arts

In this section, several state-of-the-art methods will be compared with our proposed FSFN, including SRCNN [7], FSRCNN [8], VDSR [15], DRCN [16], LapSRN [18], DRRN [28], MemNet [30], FALSR [6], and CARN [2]. We will compare our FSFN with the above methods both mathematically and visually. The above methods are divided into two categories according to the number of parameters: those that use about 500K, and those that use about 1000K. As can be seen in Table 1, when the number of parameters of the model is around 500K, the overall SR performance of our proposed FSFN_SP+ network is better than that of all other networks, especially at the scaling factor of ×2. When the number of parameters is around 1000K, our FSFN+ network also performs better than its peer network models. When the comparison criterion is structural similarity (SSIM), the SR performance of our FSFN_SP, FSFN network is superior to all other networks, even without a self-integration strategy.

4.4.1 Visual comparison

Since our proposed FSFN can separate and fuse the shallow features of a single module, we can imagine that the SR images generated by our FSFN network will have better detailed image features. Below we will show the image recovery effects of our network on each dataset. We randomly select an image from the Set14 dataset, named PPT3 as shown in Fig. 7. We can see that our network can show the words in the original image more clearly. Figure 8 shows image 8023 from the BSDS100 dataset. We can see from this image that our FSFN network is able to restore the detail of the bird’s wing with a scaling factor of 4. Figure 9 shows another image from the BSDS100 dataset named img_011. The hardest detail to recover from this image is the vertical line at the left of the locally enlarged image block. We can see that, unlike SR images generated by our FSFN network, SR images generated by other methods experience difficulty in showing the gap between vertical lines. Since the BSDS100 dataset contains a large number of high-resolution images, we have introduced a visual comparison of Figs. 10 and 11, and both images show better SR performance in our network when the scaling factor is 4 and 3. All of the above visual comparisons reflect the superior performance of our FSFN network framework (Fig. 12).

Fig. 7
figure 7

Visual comparison for 4×SR on the Set14

Fig. 8
figure 8

Visual comparison for 4×SR on the BSDS100

Fig. 9
figure 9

Visual comparison for 4×SR on the Urban100

Fig. 10
figure 10

Visual comparison for 4×SR on the Urban100

Fig. 11
figure 11

Visual comparison for 3×SR on the Urban100

Fig. 12
figure 12

Overall network framework selection; (a) represents a directly connected structure, while (b) represents a block processing structure

5 Conclusion

In this paper, we propose a lightweight feature separation and fusion network for single image super-resolution. We adopt the structural framework of block processing so that our network can enrich the features of different levels while making full use of the features of different levels. We also propose a feature separation and fusion module as our smallest module unit, which enhances the ability of our network to extract high-frequency features by separating and fusing the shallow features extracted from the interior of the smallest module unit. In this way, the SR performance of our network is improved significantly, especially in the restoration of texture details of some images. Experiments show that our network model offers better SR performance compared to other lightweight network models.