1 Introduction

As the application scenarios of virtual reality technology continue to expand, so too does the demand for image quality. High-quality images can provide users with a more immersive experience. In this context, events such as the CGI and CASA conferences are dedicated to advancing various fields within computer graphics and virtual reality, making significant contributions to the progress of these technologies. The successful application of image super-resolution techniques will undoubtedly further promote the development of this field. Particularly, the emergence of efficient image super-resolution technology has made it easier to deploy this technology on edge devices, thereby broadening its application.

Fig. 1
figure 1

Comparison with other SOTA methods for image SR on Set5. The red dots represent the method proposed in this paper

Image super-resolution (SR) is a typical branch of low-level vision methods, reconstructing high-resolution (HR) images from low-resolution (LR) inputs. Traditional SISR methods use interpolation techniques to recover corresponding HR images from LR ones. While simple and effective, these methods struggle to restore some of the details and textures in images. Since SRCNN [1] first introduced convolutional neural networks to the field of image super-resolution, deep learning (DL) has achieved remarkable performance and realistic visual effects due to its learnable feature representations. These SR networks [2,3,4,5,6,7,8,9,10,11] have significantly improved the quality of reconstructed images. Their success can be partially attributed to their larger model capacity and intensive computational power. However, this makes them difficult to deploy on resource-constrained devices in real-world applications. Therefore, it is necessary to design lightweight models to improve the efficiency of SISR models, achieving a good balance between image quality and inference time.

Many prior works [1, 12,13,14,15,16,17,18,19,20,21,22,23,24,25] have been proposed to develop efficient image super-resolution models. They use different strategies to achieve high efficiency, including parameter sharing strategy [26], cascading network with grouped convolution [27], information or feature distillation mechanisms [21,22,23] and attention mechanisms [2, 3, 22]. Although they have improved efficiency using these strategies, redundancy still exists in convolution operations.

In this paper, to make the network more lightweight, we propose a new lightweight SR network, which consists of several stacked hybrid attention separable blocks. This structure is capable of extracting higher-level image features and includes more edge features and texture details. We only use a few necessary residual connections to prevent the vanishing gradient problem while integrating low-level features. Additionally, we use depth-wise separable convolutions instead of standard convolutions in convolutional blocks, significantly reducing the computational load and the number of parameters while maintaining strong feature extraction capabilities. To fully maximize the model’s capabilities, we propose a warm-start retraining strategy to further learn the image distribution and use the geometric self-ensemble strategy during the inference phase. Specifically, our contributions are as follows:

  • We propose a hybrid attention separable network for efficient image super-resolution, which can extract higher-level image features and include more edge features and texture details without additional residual connections.

  • We propose a warm-start retraining strategy, which helps in learning the distribution of high-resolution images, effectively enhancing network performance.

  • Extensive experiments demonstrate that our proposed method surpasses existing state-of-the-art (SOTA) methods in terms of parameters (Fig. 1) and FLOPs, while maintaining comparable performance in PSNR and SSIM metrics.

2 Related work

2.1 Classical SISR methods

SRCNN [1] is the first work that introduces deep convolutional neural networks (CNNs) to the image SR task. They use a three-layer convolutional neural network to jointly optimize feature extraction, nonlinear mapping, and image reconstruction in an end-to-end manner, achieving performance superior to traditional SR methods. Subsequent methods adopt more complex convolutional module designs, such as residual blocks [22, 28, 29] and dense blocks [30], to enhance the model’s representational capacity. As networks become larger and deeper, the introduction of various attention mechanisms [2, 31] has become a new trend in image super-resolution research. For example, RCAN [32] employs channel attention, while PAN [33] uses pixel attention. Additionally, self-attention mechanisms have shown significant performance in image reconstruction. SwinIR [2] leverages the swin transformer architecture [34], multi-scale feature representation [35], hybrid attention mechanisms, and local–global feature interaction. HAT [31] further expands the window size and uses channel attention to better activate available pixels. PCCFormer [36] uses parallel attention transformer and adaptive convolution residual block to improve feature expression ability of the model. Recently, some emerging attention mechanisms have also achieved great success in imaging [37, 38]. Image super-resolution techniques have been applied in the medical field, making significant contributions to the diagnosis of brain diseases and morphometric studies [39].

2.2 Lightweight SISR methods

To meet the requirements of edge devices, it is crucial to develop lightweight and efficient SR models. The SR network SRCNN [1] achieves impressive results but also faces issues such as high computational demands. FSRCNN [12] addresses these issues by removing the interpolation upsampling, introducing transposed convolution at the end of the network, and using smaller but more numerous convolutional kernels, achieving approximately 17 times the acceleration compared to SRCNN. DRCN [14] employs recursive calls to the feature extraction layers, while DRRN [16] improves upon DRCN by combining recursive and residual networks to achieve better performance with fewer parameters. LapSRN [15] uses transposed convolution for upsampling, leveraging convolutional layers to learn the residuals between high-resolution images and upsampled feature maps, achieving multi-scale reconstruction through progressive upsampling. IDN [18] effectively extracts local long-path and short-path features through an information distillation module, achieving relatively fast inference speed. IMDN [21] constructed a cable information multi-distillation block (IMDB) consisting of distillation and selective fusion. The distillation module gradually extracts features, while the fusion module determines the importance of candidate features based on an attention mechanism and fuses them accordingly.

Recently, researchers have been optimizing convolution methods to develop lighter and more efficient SR models. For example, ECBSR [40] and RepVGG [41] effectively extract edge and texture information, while FMEN [42] and BSRN [29] further accelerate network inference and reduce the number of network parameters, achieving efficient super-resolution.

3 Methodology

Fig. 2
figure 2

Overall network architecture of our HASN

3.1 Overall network architecture

For the overall network structure of HASN, we adopt a coarse-to-fine strategy to learn representative features from LR images. As shown in Fig. 2, HASN consists of three main stages: an initial feature extraction, a multi-stage feature extraction, and a high-resolution reconstruction. Here, \(I_{LR}\) represents the original image input, \(I_{LR}\in {\mathbb {R}} ^{H\times W\times C_{in}}\)((H, W, and C are the image height, width and input channel number, respectively). A \(3\times 3\) convolutional layer \(H_{IF}(\cdot )\) is used to extract initial feature. This process can be expressed as:

$$\begin{aligned} \begin{aligned} F_{0} = H_{IF}(I_{LQ}), \end{aligned} \end{aligned}$$
(1)

The convolutional layer effectively captures local features of an image, providing feature maps for subsequent deep feature extraction. Next, \(F_0\) extracts multi-stage features using HASBs. We extract deep feature as:

$$\begin{aligned} \begin{aligned}&F_{i} = H_{{HASB}_{i}}(F_{0}), i = 1, 2, \ldots , K,\\&F_{DF} = H_{\textrm{Conv}}(F_{K}), \end{aligned} \end{aligned}$$
(2)

where \(H_{{HASB}_{i}}(\cdot )\) denotes the i-th HASB. A \(3\times 3\) convolutional layer is used after several HASBs to further process and refine the feature representations, enhancing the feature learning capability.

$$\begin{aligned} \begin{aligned}&I_{RHQ} = H_{REC}(F_{DF} + F_0), \end{aligned} \end{aligned}$$
(3)

where \(H_{REC}(\cdot )\) is the function of the reconstruction module. It consists of a \(3 \times 3\) convolutional layer and a sub-pixel layer. The \(3 \times 3\) convolutional layer reduces the dimensionality of the high-dimensional feature maps while preserving important information, preparing them for the sub-pixel layer. The entire training process is divided into two stages. The \({\mathcal {L}} _1\) loss function is exploited to optimize the model in the first stage, which can be formulated as follows:

$$\begin{aligned} \begin{aligned}&{\mathcal {L}} _1 = \left\| I_{SR}-I_{HR} \right\| _1, \end{aligned} \end{aligned}$$
(4)

The loss function for the second stage\(({\mathcal {L}} _{s2})\) is defined as follows:

$$\begin{aligned} \begin{aligned}&{\mathcal {L}} _{s2} = \alpha {\mathcal {L}} _1 + \beta {\mathcal {L}} _{D_{KL}}, \\&{\mathcal {L}} _{D_{KL}} = \textstyle \sum _{i} P_{I_{HR}}(i)log\frac{P_{I_{HR}}(i)}{P_{I_{SR}}(i)}, \end{aligned} \end{aligned}$$
(5)

where \({\mathcal {L}} _{D_{KL}}\) is KL divergence loss, which is used to measure the difference between the probability distributions of the actual high-resolution image and the predicted super-resolution image. \(P_{I_{HR}}(i)\) represents the probability distribution of the i-th pixel in the high-resolution image, and \(P_{I_{SR}}(i)\) represents the probability distribution of the i-th pixel in the super-resolution image.\( \alpha \) and \(\beta \) are two different constants, which we set to 1 in this context.

3.2 Hybrid attention separable block

Fig. 3
figure 3

a Architecture of hybrid attention separable block (HASB). b Architecture of channel attention block (CAB). c Architecture of enhanced spatial attention (ESA)

As shown in Fig. 3, our proposed HASB consists of two depth-wise separable convolutions, several fully connected layers, a channel attention block, and enhanced spatial attention. First, a \(7 \times 7\) depth-wise separable convolution operation is applied to the input features \(F_{\textrm{in}}\) to extract local features. Then, the convolved features are subjected to layer normalization, resulting in the normalized features \(F_{o}\). The normalized features \(F_{o}\) are passed to three parallel fully connected layers. The output of the first fully connected layer is passed through a ReLU6 activation function. The output of the second fully connected layer is used directly. The output of the third fully connected layer is processed through the enhanced spatial attention module. The output of the first fully connected layer is multiplied element-wise with the output of the second fully connected layer. The result of this multiplication is added element-wise to the output of the third fully connected layer (features processed by the ESA) to obtain the fused features. The fused features are passed to a fully connected layer for further processing. The features processed by the fully connected layer are passed through another depth-wise separable convolution layer to extract additional features. Finally, the features are processed through the channel attention block module to obtain the final output features. The input feature \(F_{in}\) is added directly to the features before the final depth-wise separable convolution layer (DW-Conv) through a residual connection. This design helps alleviate the vanishing gradient problem and enhances feature learning. The whole structure is described as

$$\begin{aligned} \begin{aligned}&F_{o} = LN(DWConv_{7\times 7}(F_{in} )), \\&F_{d_1}, F_{d_2}, F_{d_3} = FC(F_{o}), FC(F_{o}), FC(F_{o}),\\&F_{d} = ReLU6(F_{d_1})\otimes F_{d_2} + ESA(F_{d_3}), \\&F_{d} = DWConv_{7\times 7}(FC(F_{d})) + F_{in}), \\&F_{out} = CAB(F_{d}) \end{aligned} \end{aligned}$$
(6)

where \(DWConv_{7\times 7}\) represents a depth-wise separable convolution with a \(7 \times 7\) kernel, \(LN(\cdot )\) denotes the LayerNorm layer, and FC refers to the fully connected layer.

3.3 Warm-start retraining strategy

We propose a novel warm-start retraining strategy. Different from some previous works that use the \(2\times \) model as a pre-trained network instead of training from scratch, we train HASN for \(4\times \) from scratch in the first stage. In the second stage, we load the model weights from the first stage, which are not fully converged, and further expand the dataset (adding Flickr2K). We further learn the distribution of high-resolution images by minimizing the KL divergence loss and L1 loss, as formulated in Eq. 5. The other training settings remain consistent with the first stage.

4 Experiments

4.1 Datasets and metrics

In this paper, the entire training process is divided into two stages. In the first stage, we use the DIV2K [43] dataset, and in the second stage, we use the DF2K dataset (DIV2K + Flickr2K) [43] to further improve the network performance. DIV2K [43] is a high-quality (2K resolution) image dataset containing 800 training images. Flickr2K is an image dataset with 2K resolution containing 2650 images. Additionally, the low-resolution images of DIV2K and Flickr2K are generated from the ground truth images by the “bicubic” downsampling in MATLAB. For testing, we use five widely used benchmark datasets: Set5 [44], Set14 [45], BSD100 [46], Urban100 [47], and Manga109 [48]. We evaluate all the SR results using the PSNR and SSIM metrics on the Y channel of the YCbCr color space.

Table 1 Average PSNR/SSIM for scale factor 4 on datasets Set5, Set14, BSD100, Urban100, and Manga109

4.2 Implementation details

The proposed HASN consists of 6 HASBs, and the number of channels is set to 52. The kernel size of all depth-wise convolutions is set to 7. During training, we set the input patch size to 192 \(\times \) 192 and use random rotation and horizontal flipping for data augmentation. The batch size is set to 128, and the total number of iterations is 500k. The initial learning rate is set to \(2\times 10^{-4}\). We adopt a multi-step learning rate strategy, where the learning rate will be halved when the iteration reaches 250,000, 400,000, 450,000, and 475,000, respectively. The model is trained by Adam optimizer with \(\beta _{1} = 0.9\) and \(\beta _{2} = 0.99\). In the second stage of training, we chose the model weights from the 100k-th iteration of the first stage as the starting point, and the total number of iterations is set to 1000k. Additionally, we use \({\mathcal {L}}_{s2}\) as the loss function for the second stage. Other training settings remain consistent with the first stage. To maximize the potential performance of the HASN proposed in this paper, we use geometric self-ensemble [7] in the experiment, which is applied during inference without additional training. The networks are implemented by using PyTorch framework with a NVIDIA 3090 GPU.

4.3 Comparison with state-of-the-arts

We compare our models with several advanced efficient super-resolution models with scale factor of 4. The comparison methods include SRCNN [1], FSRCNN [12], VDSR [13], DRCN [14], LapSRN [15], DRRN [16], MemNet [17], IDN [18], SRMDNF [19], CARN [20], IMDN [21], RFDN [22], RLFN [23], DIPNet [24], SPAN [25]. Firstly, in terms of model performance, we use PSNR and SSIM as evaluation metrics. In terms of model efficiency, we use Parameters and FLOPs to measure the model size and computational complexity. The quantitative performance comparison on five benchmark datasets is shown in Table 1. Compared with other state-of-the-art models, it can be seen that HASN achieves better performance on Set5, Set14, and BSD100. Its performance on the remaining two datasets is comparable. Overall, HASN achieves performance comparable to other networks with fewer parameters and computational complexity, achieving a better balance in performance and efficiency.

5 Ablation study

In this section, we conduct a set of ablation experiments to evaluate the performance of each proposed module.

5.1 The choice of multiplication and addition in convolution block

Fig. 4
figure 4

Design of convolutional block and convergence curves of different combinations

Many previous efficient image SR methods [22, 25, 49] benefit from residual connections, which extract features from each block up to the upsampling layer. Some methods [21,22,23] also perform feature distillation within each block. However, these approaches often make the network structure redundant. We want to design an efficient and compact network. Inspired by [50], element-wise multiplication seems to provide greater gains in a narrower network compared to addition. This finding is beneficial for our task, as we need to minimize network size while achieving equal or better performance compared to previous methods. Therefore, we design some simple experiments to validate this conclusion. As shown in Fig. 4, (a) presents the structure of the CB module. (b) illustrates the fitting curves of four different configurations. It is evident that when activation function is not used, element-wise multiplication performs significantly better than addition, despite some instability during training. When activation function is included, both addition and multiplication configurations exhibit smooth fitting curves, and the PSNR on the test set shows that the network using multiplication slightly outperforms the one using addition. As shown in Table 2, we set up networks with three different embedding dimensions. We find that in Urban100, the PSNR gain between element-wise multiplication and addition decreases as the dimension increases, from 0.08 dB to 0.07 dB, and finally to 0.01 dB. On other test sets, the changes do not seem to follow a consistent pattern. However, across various dimensions, using element-wise multiplication generally yields better performance.

Table 2 Quantitative comparison (average PSNR/SSIM) of element-wise multiplication and addition across different embedding dimensions on benchmark datasets

5.2 Study on HASB number

From Fig. 5, we can observe that with the increase in the number of HASBs, the PSNR shows an upward trend when the HASB number is less than or equal to 10. However, when the HASB number is set to 12, there is a sharp decline in PSNR for Set5. This phenomenon indicates that while increasing the number of HASB modules can enhance the model’s feature extraction capability to some extent, excessively increasing them may lead to overfitting the training data. Due to the complexity of the attention mechanism and fully connected layers within the HASB modules, the model may capture noise and details from the training data, resulting in a reduced generalization ability on the test data. As shown in Table 3, with the increase in the number of HASBs, the model’s parameter count and computational complexity also increase. Setting the HASB number to 6 balances the model size and performance.

Fig. 5
figure 5

PSNR of different numbers of HASB on Set5

Table 3 Quantitative comparison (average PSNR/SSIM) of different HASB number on benchmark datasets

5.3 Study on kernel size of depth-wise convolution

Table 4 Quantitative comparison of different kernel sizes. We use the average PSNR/SSIM on the datasets Set5, Set14, BSD100, Urban100, and Manga109 as the metric

To explore the impact of convolution kernel size on network performance, we set the kernel sizes of all depth-wise convolutions to 3, 5, 7, and 9, respectively. As shown in Table 4, we observed that performance improves with larger kernel sizes across the five benchmark datasets. However, as the kernel size increases, the number of network parameters and FLOPs also increase. From the table, the best results are seen between kernel sizes 7 and 9. To balance computational complexity and the number of parameters, choosing a kernel size of 7 is appropriate.

5.4 Study on residual connection

To explore the role of residual connections in image super-resolution, we use intermediate feature visualization to observe the changes in the network’s intermediate features, as shown in Fig. 6. (d) and (f) show feature map visualizations without and with residual connections, respectively. From left to right, the features progress from lower to higher layers, gradually shifting from capturing detailed information (such as edges and textures) to more abstract information (such as shapes and overall contours). The lower layer feature maps focus more on local features, while the information in the feature maps becomes more abstract and global as the layers deepen.

Comparing (d) and (f), we observe that the feature maps in (d) capture more information at each layer, retaining more edge and texture details. In contrast, the feature maps in (f) lose detail information more quickly and shift to more abstract representations. This suggests that in our method, CBs [50] may be sufficient to learn important features, while using excessive residual connections could introduce noise. The quantitative performance comparison on several benchmark datasets is shown in Table 5. The PSNR on Set5, Set14, B100, Urban100, and Manga109 improved by 0.13dB, 0.07dB, 0.03dB, 0.08dB, and 0.02dB, respectively.

Fig. 6
figure 6

a Basic network consists of several CBs and a \(3 \times 3\) convolutional layer. b Based on (a), a residual connection is used after each CB. c Network structure of the convolutional block. d Feature map visualization of the intermediate layers in (a) and (b)

Table 5 Quantitative comparison of networks with and without residual connections
Table 6 Quantitative results of the state-of-the-art models on five benchmark datasets
Fig. 7
figure 7

Visualization analysis of the impact of CAB and ESA on network feature extraction

Table 7 Quantitative comparison of SPAB and HASB
Table 8 Quantitative comparison of different activation functions
Table 9 Quantitative comparison of models with and without the warm-start retraining strategy

5.5 Effectiveness of HASB architecture

To investigate the impact of different configurations of individual modules in HASB on network performance, we conduct a set of comparative experiments, as shown in Table 6. For example, on Set5, adding CAB to CB increases the PSNR by 0.09dB and the SSIM by 0.0009. Adding ESA to CB increases the PSNR by 0.14dB and the SSIM by 0.0013. When both modules are added, the PSNR and SSIM increase by 0.2dB and 0.0022, respectively. Compared to the remaining five benchmark datasets, our network achieves the best performance when combining CB with the other two attention modules.

To explore the reason behind this phenomenon, we visualize the output features of the last two layers for these four different network structures, as shown in Fig. 7. We can observe that when these two attention modules are not added, the last two layers of the network extract high-level features that focus on local features with fewer details near the output. In contrast, with the addition of these two attention modules, edges and textures near the network input gradually increase. In low-level vision tasks, low-level features are beneficial for improving network performance.

Additionally, we aim to investigate the characteristics of HASB in advanced feature extraction and low-level feature retention. Therefore, we select SPAB [25], which leverages a parameter-free attention mechanism to achieve feature extraction from shallow to deep layers while maintaining low model complexity and parameter count. We replace HASB with SPAB, keeping all other experimental settings the same. As shown in Table 7, the parameter count of HASB is almost half that of SPAB, but it achieves significant improvements in both PSNR and SSIM across five benchmark datasets.

5.6 Exploration of different activation functions

Most of the previous SR networks adopt ReLU [51] or LeakyReLU [52] as the activation function. ReLU6 [53] is a variant of the ReLU activation function that constrains the output between 0 and 6. It is widely used in mobile and embedded devices because it can provide stable performance in low-precision computing environments. The results in Table 8 show that different activation functions can obviously affect the performance of the model. Among these activation functions, ReLU and ReLU6 perform comparably. In our experiments, we chose ReLU6 as the activation function.

5.7 Effectiveness of warm-start retraining strategy

To demonstrate the effectiveness of our proposed warm-start retraining strategy, we use HASN trained from scratch with DIV2K as the baseline. As shown in Table 9, when not expanding the training set, our model shows a slight performance improvement with the warm-start retraining strategy. When further expanding the training set, our model achieves PSNR improvements of 0.11dB, 0.07dB, 0.06dB, 0.15dB, and 0.17dB on the five benchmark datasets.

6 Conclusion

In this paper, we propose a hybrid attention separable network for efficient image super-resolution (HASN). To make the network more efficient, we use only a few necessary residual connections to avoid gradient vanishing. We design a simple CB module to extract high-level features from the input image and used two essential attention modules (ESA, CAB) to enhance edges and textures near the network input. We conduct extensive feature visualizations to comprehensively analyze the effectiveness of the network structure. Additionally, we propose a warm-start retraining strategy to further exploit the network’s performance. Extensive experiments have shown that the proposed method achieves a better balance in performance and lightweight design compared to other networks.