1 Introduction

Single image super-resolution (SISR) aims to restore high-resolution (HR) images from their corresponding low-resolution (LR) counterparts. This technique holds paramount importance in diverse applications such as remote sensing [1], medical imaging [2], hyperspectral imaging [3], and surveillance [4]. Despite its significance, the inherent ill-posedness of SISR renders accurate image restoration challenging. The advent of convolutional neural networks (CNNs) introduced a transformative approach, facilitating a direct mapping from LR to HR images. Dong et al. [5] pioneered this arena with their SRCNN model, surpassing conventional methods. This led to the proliferation of CNN models, further refining SISR methodologies. Nonetheless, their extensive computational demands hinder efficient deployment on edge devices, as shown in Fig. 1.

To address the computational and storage constraints of edge devices, recent research on SISR is inclining towards the development of lightweight neural network architectures. CNN-based strategies have prevailed, frequently integrating residual or densely connected blocks, often paired with attention mechanisms, to enhance performance. For instance, PAN [6] combined attention mechanisms with residual learning, improving performance. However, CNNs, with their inherent limitations in extracting local features, struggled with long-distance dependencies. Consequently, researchers turned to the transformer for SISR tasks. SwinIR [7] stood out by incorporating transformer components, establishing a solid baseline for image restoration, and emphasizing transformers’ potential. ESRT [8] blended CNNs and transformers to develop a lightweight architecture, leading to an efficient SISR solution. Yet, both CNNs and Transformers often overlook neighborhood and contextual features. Earlier studies posited that neurons should adapt based on behavior. [9] devised a dynamic convolution mechanism adjusting weights per contextual cues. [10] unveiled a context-gated convolution, adaptively altering convolutional kernel weights via context. Recognizing that pixels in images aren’t isolated, it’s understood that they interact with their surroundings. Notably, the intrinsic quadratic complexity of transformers posed computational challenges. [11] addressed this by representing self-attention as a linear dot product of kernel feature maps, thus reducing complexity to linear levels. [12] utilized a binarization paradigm, approximating linear complexity attention mechanisms through binary code dot products.

Inspired by prior work, we propose the joint feature-guided linear transformer and CNN for an efficient image super-resolution network (JGLTN). Our approach integrates multi-level contextual feature to guide the self-attention mechanism and refines feature similarity calculations, aiming to reduce the transformer’s complexity from quadratic to linear. JGLTN comprises several CNN layers and linear transformer layers cascades, each consisting of a CNN layer and a linear transformer layer. Within the CNN layer, we introduce an inter-scale feature integration module (IFIM). This module utilizes the latent information mining component (LIMC) to extract features, emphasizing valuable information while discarding redundancies. For the linear transformer layer, we put forth the joint feature-guided linear attention (JGLA), anchored by the multi-level contextual feature aggregation (MCFA) block, to effectively integrate adjacent, extended regional, and contextual features. To further optimize self-attention, we revisit feature similarity computations, ensuring maintained linear complexity. Our primary contributions are summarized as follows:

  1. (1)

    We introduce a latent information mining component (LIMC) that filters redundant information and flexibly learns local data. Concurrently, we specially design an inter-scale feature integration module (IFIM) to meticulously combine LIMC, ensuring cross-scale feature learning.

  2. (2)

    We develop a joint feature-guided linear attention (JGLA), utilizing the designed multi-level contextual feature aggregation (MCFA) to synthesize local and extended regional features and adaptively adjust the weights of modulation convolution kernels, enabling the selection of required contextual information. This approach facilitates self-attention in guiding the information exchange. Additionally, we revisit the feature similarity computation method in attention mechanisms, reducing the computational complexity of self-attention to linear complexity.

  3. (3)

    We construct a joint features-guided linear transformer and CNN for efficient image super-resolution network (JGLTN). Experiments on five benchmark datasets show that our approach achieves an ideal balance between the computational cost and performance of the model.

Fig. 1
figure 1

Model inference time comparison on Set5 dataset(x4)

2 Related work

2.1 Efficient SISR model

Deep learning models such as EDSR [13], ENLCN [14], and DAT [15] have showcased superior performance SISR. However, their high parameter complexity and computational demands limited their practical deployment. As a result, recent research emphasized lightweight SR models. For example, CARN [16] enhanced efficiency through cascaded residual networks, IMDN [17] leveraged distillation and fusion modules for feature aggregation, and RFDN [18] applied channel splitting and fusion residual connection strategies. Similarly, LatticeNet [19] simplified model complexity with its lattice block and backward feature fusion; RepSR [20] refined SR by reintroducing BN and employing structural re-parameterization; FDIWN [21] integrated wide-residual distillation connection and self-calibration fusion to capture multi-scale details. LatticeNet-CL [22] enhanced performance using its novel lattice block (LB) and a contrastive loss, while GASSL [23] innovated in structured pruning with a sparse structure alignment technique. AsConvSR [24] offered a divide-and-conquer tactic in SR by modulating convolution kernels based on input features, culminating in a swift and compact super-resolution network.

The advent of attention mechanisms significantly advanced lightweight SISR tasks. MAFFSRN [25] incorporated multi-attention blocks (MAB) into a feature extraction group (FFG) to bolster feature fusion. Drawing from attention mechanisms, PAN [6] pioneered a pixel attention (PA) strategy, enabling SR with reduced parameters. This method seamlessly merged PA attention into the primary and reconstruction branches, yielding two innovative building blocks. Similarly, A2N [26] employed an attention-in-attention strategy, which enabled more proactive pixel attention adjustments and enhanced the utilization and comprehension of attention tasks within SISR. PFFN [27] unveiled a progressive attention module to optimize the potential of feature mapping by broadening the receptive field of individual layers. RLFN [28] presented a distinctive residual local feature network, refining feature aggregation and re-examining the contrastive loss. FMEN [29] devised an enhanced residual block paired with a sequential attention branch, accelerating network inference. However, many current CNN-based lightweight SISR models might treat all features uniformly, neglecting the critical nuances of finer details, which could compromise the network’s reconstruction proficiency.

2.2 Vision transformers

Recently, the Vision Transformer (ViT) has demonstrated robust potential in low-level visual tasks such as image denoising [30], deblurring [31], enhancement [32], and dehazing [33]. The exceptional ability of ViT to capture long-range information has also been evident in SISR tasks. Precisely, the HAT [34], by integrating channel attention with window self-attention strategies, had further enhanced pixel reconstruction accuracy. DAT [15] alternated between spatial and channel attention within transformer blocks and introduced an SGFN network to integrate intra-module features. While transformer-based approaches have achieved significant results, their deployment in lightweight SISR tasks remains challenging.

Consequently, recent research is dedicated to integrating transformers into lightweight SISRs. SwinIR [7], based on the Swin Transformer [35], designed a reconstruction network comprising multiple residual swin transformers and optimized it to a lightweight version, demonstrating impressive reconstruction outcomes. ESRT [8] developed an efficient multi-head transformer structure for SISR, which significantly reduced memory consumption, thereby enhancing feature representation capabilities. To further capture long-distance dependencies, ELAN [36] introduced a novel multi-scale self-attention mechanism that employed different window sizes for attention computation. NGswin [37], built on SwinIR, proposed a method that interacted via sliding window self-attention to extend degradation areas, integrating N-Gram into SISR and achieving an efficient SR network. These transformer-based lightweight networks have further advanced the performance of SISR tasks.

Introducing the Transformer increased computational demands due to the quadratic complexity of the self-attention mechanism. Addressing this issue, researchers have explored various linear ViT methodologies that capitalize on linear attention to reduce complexity to a linear magnitude. Notable studies [11, 38,39,40,41,42] have adopted kernel-based linear attention strategies to bolster ViT efficacy. These methods eschewed the softmax function, refined computational efficiency by reordering the self-attention computation, and leveraged either kernel functions or comprehensive self-attention matrices. Specifically, [11, 38] managed to attain a linear complexity, preserving a performance on par with the conventional ViT by transforming the softmax-based self-attention’s exponential term using kernel functions and modifying the calculation sequence. Additionally, [41, 42] harnessed low-rank approximation and sparse attention to further optimize linear ViT’s proficiency. In this paper, we reassess feature similarity computation for token similarities, creating a linear self-attention mechanism. This mechanism bypasses the softmax procedure and reaches linear computational complexity, curtailing computational expenses while upholding superior performance.

3 Approach

Fig. 2
figure 2

Overall framework of proposed networks (JGLTN). It consists of a cascade of CNN layers and linear transformer layers

3.1 Network architecture

We propose a joint feature-guided linear transformer and CNN for efficient SISR (JGLTN). As depicted in Fig. 2, JGLTN comprises multiple CNN layers cascaded with linear transformer layers, and it further integrates two reconstruction modules. The CNN layers primarily focus on extracting beneficial cross-scale features while filtering out redundant ones. On the other hand, the linear transformer layers emphasize amalgamating adjacent, extended regional, and contextual features. This ensures the transformer not only boasts global modeling prowess but also excels in joint feature modeling, all while reducing the computational complexity of the transformer to a linear scale.

Given an input \(I_{LR}\), the size is initially modified through a convolution \(f_{conv}\) to obtain the shallow feature \(I_0\).

$$\begin{aligned} {I}_\textrm{0}={f_\textrm{conv}}(I_\textrm{LR}) \end{aligned}$$
(1)

Subsequently, \(I_0\) is utilized as the input. The process involving n cascaded CNN layers and linear transformer layer can be denoted as:

$$\begin{aligned} I_n=\zeta ^n(\zeta ^{n-1}(...(\zeta ^1(I_0)))) \end{aligned}$$
(2)

where each \(\zeta ^i\) can be represented as the i-th CNN layer \(f_{_{CNN}}^i\) and channel expansion, as well as the i-th linear transformer layer \(f_{_{LT}}^i\) and channel recovery operations, collectively depicting the entire process as:

$$\begin{aligned} \zeta ^i=C_{rec}(f_{_{LT}}^i(C_{_{\exp }}(f_{_{CNN}}^i(I^{i-1})))) \end{aligned}$$
(3)

where \(I^{i-1}\) denotes the output from the (i-1)-th CNN layers and linear transformer layer, and \(C_{_{\exp }}\) and \(C_{rec}\) respectively represent channel expansion and channel recovery, specifically through convolutional layers that alter the channel dimensions of features. In JGLTN, \(C_{_{\exp }}\) expands the channel number to 144, while \(C_{rec}\) reduces the channel count back to 48. Ultimately, both \(I_n\) and \(I_{LR}\) are subjected to convolutional upsampling to finalize the image reconstruction.

$$\begin{aligned} I_{SR}=f_{rec}(I_{n})+f_{rec}(I_{LR}) \end{aligned}$$
(4)

where \(f_{rec}\) represents a reconstruction module that encompasses a PixelShuffle operation and a 3x3 convolution operation, \(I_{SR}\) stands for the reconstructed high-resolution image.

For the proposed JGLTN network, its loss function L can be expressed as:

$$\begin{aligned} L(\Theta )= & {} \arg \min _{\Theta }\frac{1}{m}\sum _{i=1}^{m}\left\| f_{JGLTN}(I_{LR}^{i})-I_{HR}^{i}\right\| _{1} \nonumber \\ ~= & {} \arg \min _{\Theta }\frac{1}{m}\sum _{i=1}^{m}\left\| I_{SR}^{i}-I_{HR}^{i}\right\| _{1} \end{aligned}$$
(5)

where \(f_{JGLTN}(\cdot )\) represents the network model we proposed, \(I_{LR}^{i}\) is the input low-resolution image, \(I_{HR}^{i}\) is its corresponding high-resolution ground truth image, \(\Theta\) denotes the learnable parameters within the network, and m signifies the number of image pairs in the dataset.

3.2 CNN layer

Within this layer, we present the inter-scale feature integration module (IFIM) designed to capture intricate feature details. As depicted in Fig. 3, the primary components of inter-scale feature integration module include two latent information mining component (LIMC) and a mechanism for cross-scale feature learning.

Fig. 3
figure 3

Structure of the inter-scale feature integration module (IFIM), which is composed of the latent information mining component (LIMC), is used to filter unnecessary features and extract valuable information

3.2.1 Inter-scale feature integration module

CNN-based models are notably adept at feature extraction. However, they often incorporate superfluous features. Furthermore, the receptive field of CNNs is inherently restricted, but utilizing features from multiple scales proves crucial for understanding intricate details. To tackle this, we introduce the inter-scale feature integration module (IFIM). This module amalgamates a latent information mining component (LIMC) with a cross-scale learning mechanism and stride convolution. Such a synthesis guarantees the model’s proficiency in pinpointing and assimilating valuable features across varied scales.

Specifically, inter-scale feature integration module (IFIM) bifurcates the input into two branches. One branch retains the original feature dimensions, making selections from the native features. The other branch expands its receptive field to learn features within a larger spatial context. We articulate this entire process as

$$I_{IFIM}= f_{LIMC}(f_{LIMC}(f_{_{RB}}(I_0)) \times (f_{_{sig}}(f_{_{stride}}(f_{_{\textrm{CL}}}(I_0))+I_0)))+I_0$$
(6)

Where \(f_{_{RB}}\) represents a residual block. For the shallow feature input, \(I_0\), features are first extracted through a foundational residual block, followed by feature selection using our proposed latent information mining component (LIMC). \(f_{_{\textrm{CL}}}\) corresponds to a learnable upsampling [43]. For \(f_{_{\textrm{CL}}}\), a mask is predicted through a convolution layer, representing the relationship between each original resolution pixel and its neighboring pixels; the original pixels are weighted sums of their neighborhood pixels, with weights provided by this mask. Subsequently, an unfold operation is used to expand the tensor. Essentially, this operation extracts a 3x3 neighborhood around each point, stretches these neighborhoods into one-dimensional vectors, and reshapes them to be weighted by the mask, allowing each pixel and its neighborhood information to be weighted by the corresponding weights in the mask. Finally, a summing operation aggregates the weighted neighborhood information of each pixel, resulting in a high-resolution output. Thus, it guides the transformation of features in the original feature map within a larger receptive field. This is followed by a stride convolution \(f_{_{stride}}\) with a stride of 2, residual connection, and a Sigmoid function \(f_{_{sig}}\). \(f_{LIMC}\) is the latent information mining component (LIMC), further detailed in the ensuing segment. The cross-scale learning mechanism integrates information across different scales, enabling the network to better understand the structure within images.

Fig. 4
figure 4

Structure of the linear transformer layer, including multi-level contextual feature aggregation (MCFA), joint feature guided linear attention (JGLA) and contextual information perceptron

3.2.2 Latent information mining component

Building on our prior research in the SISR task, we observe a tendency for reconstructed images to exhibit over-smoothing. This problem emerges mainly because the model indiscriminately processes all input image features, neglecting that certain features are pivotal for recovering image texture details. To address this, we introduce the latent information mining component (LIMC), which emphasizes texture details, as illustrated in Fig. 3. The underlying principle of latent information mining component (LIMC) is to filter out redundant features by generating masks, thereby better preserving vital image details. Initially, input features are processed through convolutions and activation functions, yielding mixed features. Following this, a 1x1 convolution is utilized to modify the channel count of these mixed features to three. Weights are assigned to each channel via a softmax operation, efficiently discarding superfluous features.

$$\begin{aligned} I_{CR}^{'}= & {} f_{CR}(I_{CR}) \end{aligned}$$
(7)
$$\begin{aligned} I_{LIMC}= & {} f_{CSAM}(f_{CR}(f_{SM}(f_{1\times 1}(I_{CR}^{'}))\times I_{CR}^{'}))+I_{CR} \end{aligned}$$
(8)

where \(f_{CR}\) denotes the combination of convolution and activation functions, the output feature \(I_{CR}^{'}\) is produced. Subsequently, a 1x1 convolution, represented by \(f_{1\times 1}\), generates a mask. After a softmax operation \(f_{SM}\), this mask filters out superfluous features. The resulting weight mask is then multiplied with the input \(I_{CR}^{'}\) to retain sections of the original feature that contribute significantly to texture detail. Finally, an additional convolution layer and a channel spatial attention mechanism \(f_{CSAM}\) using 3D convolution are introduced to refine features further. CSAM utilizes the channel-spatial attention from the HAN [44], incorporating responses from all dimensions of the feature maps. A 3D convolutional layer is used to capture the joint channel and spatial features within the feature maps, generating an attention map. This is achieved by applying a 3D convolutional kernel to the data cube formed by multiple adjacent channels of the input features. The 3D convolutional kernels, sized 3x3x3 with a stride of 1, convolve with three consecutive groups of channels, each interacting with a set of 3D convolutional kernels, to produce three sets of channel-spatial attention maps. Through this process, CSAM is capable of extracting robust representations to describe inter-channel and intra-channel information across consecutive channels. A residual connection complements this to ensure the effective operation of the latent information mining component (LIMC). As a result, within inter-scale feature integration module (IFIM), latent information mining component (LIMC) retains the most valuable features via the weight mask.

3.3 Linear transformer layer

The powerful global feature learning of transformers presents a novel approach for the SISR task. However, transformers currently employed in SISR often overlook adjacent and extended regional features, and contextual features while also having a high computational complexity. To address these issues, we propose a joint-feature guided linear attention (JGLA). Within JGLA, we employ a multi-level contextual feature aggregation (MCFA) to obtain joint features. This block carefully considers adjacent, extended regional, and contextual features, optimizing the surrounding and contextual information of a given pixel during feature reconstruction. Furthermore, refining the self-attention mechanism, we streamlined the vector similarity computation, redesigning the self-attention calculations and effectively transitioning the complexity to a linear scale.

3.3.1 Multi-level contextual feature aggregation

Transformers are powerful at global feature modeling, both overlook the significance of adjacent, extended regional, and contextual features. Typically, in traditional transformers, the self-attention mechanism derives self-attention values from three linear layers following linear embedding. Our objective is for our transformer to handle adjacent, extended regional, and contextual features. As a result, we have modified the linear embedding layer within the transformer, replacing it with our proposed multi-level contextual feature aggregation (MCFA).

Multi-level contextual feature aggregation (MCFA) is adept at harnessing adjacent, extended regional, and contextual features. Specifically, the MCFA comprises a 3x3 convolution, a dilated convolution, and a context modulation convolution (CMC), as illustrated in Fig. 4. For the input feature \(I_{Input}\), the entire process can be expressed as:

$$\begin{aligned} I_{MCFA}=f_{CMC}(f_{relu}(Concat(f_{adj}(I_{Input}),f_{ext}(I_{Input})))) \end{aligned}$$
(9)

where \(f_{adj}\) represents a standard 3x3 convolution, which learns local features from adjacent feature vectors, \(f_{ext}\) is a dilated convolution, capturing a broader receptive field to grasp surrounding features better. After concatenating \(f_{adj}\) and \(f_{ext}\), a ReLU operation is applied.

Subsequently, global contextual information is adaptively learned through \(f_{CMC}\). In the CMC, for input features of size \(c\times h\times w\), a max pooling operation first reduces the dimensions to \(k\times k\). A shared-parameter linear layer projects the spatial location information into a vector of size \((k\times k)/2\), from which new channel weights are generated. To alleviate the time-consuming core modulation caused by a large number of channels, the concept of grouped convolution is applied to the linear layer, resulting in an output dimension o and facilitating channel interaction. Subsequently, another linear layer produces tensor outputs in two directions, \(o\times 1\times k\times k\) and \(1\times c\times k\times k\). These two tensors are added element-wise to form our modulated convolutional kernel, resized to \(o\times c\times k\times k\), to simulate the convolution kernel in actual convolution operations. The simulated modulated convolutional kernel is then multiplied by an adaptive multiplier W and reshaped. W is an adaptive multiplier, matching the size of the tensor, and upon multiplication with the tensor, it transforms into a set of trainable type parameters that are bound to the module, allowing the weights of W to be automatically learned and modified during training to optimize the model. For the input \(X_{input}\), the unfold function extracts sliding features of size \(k\times k\), linking the context between different pixel features, resulting in a feature map size of \(X_{input}^{'}\in k^2c\times hw\). Ultimately, a fully comprehended modulated convolutional kernel W’ can use the input features to obtain context guidance, adaptively capturing the required contextual information for key pixels.

3.3.2 Joint-feature guided linear attention

To reduce the computational complexity of the self-attention mechanism, we introduce the linear transformer. The essence of linear attention lies in decomposing the similarity measure function into distinct kernel embeddings, denoted as \(S(Q,K)\approx \Phi (Q)\Phi (K)^T\). Consequently, leveraging the properties of matrix calculations, we can rearrange the computation order to \(\Phi (Q)(\Phi (K)^TV)\). The complexity of the attention mechanism is no longer the token length but depends on the feature dimension. Here, the feature dimension is much smaller than the token length. We redesign the similarity computation between vector pairs, formulating a linear attention apt for the SISR task. This offers a substitute for conventional softmax-based attention, enhancing performance outcomes. Specifically, the dot product of two vectors, \(v_1\) and \(v_2\), is defined as:

$$\begin{aligned} v_1\cdot v_2=\Vert v_1\Vert \cdot \Vert v_2\Vert \cos \theta \end{aligned}$$
(10)

where \(\theta\) represents the angle between vector pairs, consequently, the angle between vector pairs can be expressed as:

$$\begin{aligned} \theta (v_1,v_2)=\arccos \left( \frac{\langle v_1,v_2\rangle }{\Vert v_1\Vert \cdot \Vert v_2\Vert }\right) \end{aligned}$$
(11)

where \(<\cdot>\) denotes the inner product and \(\left\| \cdot \right\|\) signifies the Euclidean norm. Thus, the range of \(\theta\) is \([0,\pi ]\). We incorporate this angular calculation into the similarity computation between the Q and K values in the attention mechanism, articulated as:

$$\begin{aligned} S(Q,K)=1-\frac{1}{\pi }\cdot \theta (Q,K) \end{aligned}$$
(12)

We confine the output range of S(QK) to [0, 1]. If the distance between Q and K decreases, the angular distance \(\theta\) also decreases, approaching \(\theta\), making S(QK) near 1. On the other hand, as the distance between Q and K grows, \(\theta\) increases, leading Sim(QK) to approach 0, which signifies a diminished similarity between Q and K.

Substituting Eq. 10 into Eq. 11, we obtain:

$$\begin{aligned} S(Q,K)=1-\frac{1}{\pi }\arccos \!\left( \frac{\langle Q,K\rangle }{\Vert Q\Vert \cdot \Vert K\Vert }\right) \end{aligned}$$
(13)

To simplify Eq. 10, we employ trigonometric calculations and infinite series expansion.

$$\begin{aligned}\textrm{S}(Q,K)&=1-\frac{1}{\pi }\biggl (\frac{\pi }{2}-\arcsin (Q\cdot K^{T})\biggr ) \\&=\frac{1}{2}+\frac{1}{\pi }\cdot (Q\cdot K^T) \\&+\frac{1}{\pi }\cdot \sum _{t=1}^{\infty }\frac{(2t)!}{2^{2t}(t!)^2(2t+1)}(Q\cdot K^T)^{2t+1} \end{aligned}$$
(14)

In this configuration, \((Q\cdot K^{T})\) denotes a normalization operation \(\left( \frac{\langle {Q},K\rangle }{\Vert Q\Vert \cdot \Vert K\Vert }\right)\). As observed from the above equation, the first term can serve as a similarity measure in linear attention with a complexity of O(n). The latter term, being of a higher order, introduces greater complexity. To tailor our linear attention mechanism for the SISR task, we propose employing a linear expansion approach. Given that our Q and K are near zero, we suggest retaining only the first linear term and discarding the higher-order terms. This ensures linear complexity conservation while sidestepping additional complexity induced by elevated order elements. Moreover, extant studies suggest that softmax-based attention can generate a full-rank attention map, mirroring model feature diversity. However, linear attention cannot yield a full-rank attention map [45]. As a countermeasure, we introduce a DWconv, reviving the rank of the attention matrix and ensuring feature diversity.

Therefore, our attention mechanism can be described as:

$$\begin{aligned} \begin{aligned}L&=S(Q,K)\cdot V\\ {}&\approx \frac{1}{2}\cdot V+\frac{1}{\pi }\cdot Q\cdot (K^T\cdot V)+f_{DW}\cdot V\end{aligned} \end{aligned}$$
(15)

Given that \(\frac{1}{2}\cdot V+\frac{1}{\pi }\cdot Q\cdot (K^T\cdot V)\) is the linear term from Eq. 13 and \(f_{DW}\) represents a depthwise separable convolution to achieve a full-rank attention matrix, thereby enriching feature diversity. Consequently, the final transformer layer consists of two joint feature-guided linear attention (JGLA) and a context information perceptron, as shown in Fig. 4. For the context information perceptron, we retain the context modulation convolution (CMC), substituting the multi-layer perceptron from the traditional transformer.

Table 1 Performance of our method compared with state-of-the-art SR methods with BI degradation for \(\times 2\),\(\times 3\) and \(\times 4\) image super-resolution on benchmark datasets
Table 2 Performance of our method compared with state-of-the-art transformer-based methods on benchmark datasets

4 Experiments

4.1 Datasets and metric

Consistent with previous works, we utilize the DIV2K [53] dataset for our training, which consists of 800 training images. We further validate the effectiveness of our model using five public benchmark datasets for testing, including Set5 [54], Set14 [55], B100 [56], Urban100 [57], and Manga109 [58]. Additionally, the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) are utilized as metrics to evaluate the image restoration quality on the Y channel.

4.2 Implementation details

For training, we apply x2, x3, and x4 upscaling factors. The batch size is set to 16. Additionally, for data augmentation, we apply random rotations of 90\(^{\circ }\), 180\(^{\circ }\), 270\(^{\circ }\), and randomly crop a patch of size 48x48 as input. We set the initial learning rate at \(5e^{-4}\) and employ the Adam optimizer with parameters \(\beta _1\) = 0.9, \(\beta _2\) = 0.999, and \(\epsilon = 1e^{-8}\). The model features an input channel count 48, with channel adjustments after each CNN layers and linear transformer layers invocation. Furthermore, the training architecture consists of eight CNN layers and eight linear transformer layers. All experiments are conducted on an Nvidia RTX A6000 GPU.

4.3 Comparison with advanced lightweight SISR models

In this section, we compare our approach with state-of-the-art lightweight SISR methods, including [16,17,18,19, 23, 25, 26, 29, 46,47,48,49,50,51]. We will analyze from two perspectives: qualitative analysis and quantitative analysis.

4.3.1 Quantitative evaluations

We validate the quantitative performance of our model by comparing it with state-of-the-art models on five benchmark datasets at x2, x3, and x4 scales, as presented in Table 1. We adhere to standardized methods for the uniform calculate the number of parameters. For the calculation of Multi-Adds, we are consistent with other methods and set the HR image size to 1280x720. The results highlight the superior performance of our method over advanced models. Notably, JGLTN consistently ranks among the top across all datasets. At the x4 scale, our approach exceeds the LatticeNet-CL model by margins of 0.25dB on Set5, 0.14dB on Urban100, and 0.19dB on Manga109. Such enhanced performance is attributed to the seamless integration of CNN and transformer, efficiently capturing adjacent, extended regional, and contextual information. Additionally, adapting linear attention for SISR tasks simplifies the model’s design while ensuring excellent performance.

Furthermore, we conducted a comparative analysis of our model and advancing transformer-based models, as illustrated in Table 2. To demonstrate the comparison more vividly between our method and advanced transformer-based approaches, we have introduced the use of average PSNR/SSIM for evaluation. This allows for a rapid and informative assessment of the algorithm’s effectiveness by examining the average performance. Compared to lightweight models such as SwinIR, ESRT, and NGswin, our approach demonstrates equivalent efficacy, maintaining a similar scale of parameters and computational resources. It is worth noting that compared to SwinIR, JGLTN still maintains a comparable number of parameters with lower Multi-Adds. It is noteworthy that SwinIR uses a pretrained model for initialization and sets the patch size to 64x64 during training. Extensive experiments have shown that larger patch sizes yield better results. However, JGLTN utilizes a patch size of 48x48. Moreover, SwinIR employs an additional dataset (Flickr2K [59]) for training, which is crucial for further enhancing model performance. To ensure a fair comparison with methods like ESRT, we did not use this external dataset in our work. The results of JGLTN on some datasets even surpass those of SwinIR, with higher average PSNR, while still maintaining comparable parameters and fewer Multi-Adds.

Fig. 5
figure 5

Qualitative comparison of our JGLTN with recent state-of-the-art lightweight image SR methods for the \(\times\)4 SR on Set14 and B100 datasets. The performance of each image patch is shown in the following

Fig. 6
figure 6

Qualitative comparison of our JGLTN with recent state-of-the-art lightweight image SR methods for the \(\times\)4 SR on Urban100 Datasets. The performance of each image patch is shown in the following

4.3.2 Qualitative evaluations

To more thoroughly examine the efficacy of our model, we performed a qualitative comparative analysis alongside the current advancing models. As depicted in Figs. 5 and 6, prior methods manifest issues such as boundary-blurring, excessive smoothing, line distortions, and, in some instances, alterations to the original image structure, significantly diminishing its visual appeal. In stark contrast, our proposed model preserves the original image structure, maintains sharp boundaries, and retains intricate texture details. Specifically, the \('ppt3'\) in Fig. 5, our method provides more pronounced edge information than representative CNN and transformer methods. Furthermore, within \('img{\_}78'\) of the Urban100 dataset, while numerous advanced techniques compromise the original image structure, our approach successfully maintains the essential original information, yielding a visually superior result. Therefore, JGLTN not only upholds a high PSNR metric but also ensures enhanced visual outcomes.

Table 3 The effect of the latent information mining component (LIMC) in terms of PSNR score on the benchmark datasets for x4 SR
Table 4 The effect of the inter-scale feature integration module (IFIM) in terms of PSNR score on the benchmark datasets for x4 SR
Table 5 The effect of the multi-level contextual feature aggregation (MCFA) and joint feature-guided linear attention (JGLA) in terms of PSNR score on the benchmark datasets for x4 SR
Table 6 The effect of integrating CNN layer(CNN) with linear transformer layer(LT) in terms of PSNR score on the benchmark datasets for x4 SR

4.4 Ablation study

4.4.1 The effectiveness of CNN layer

Effectiveness of latent information mining component (LIMC): We validate the effectiveness of the LIMC within the JGLTN model. Experiments are conducted by excluding the LIMC module from inter-scale feature integration module (IFIM). Notably, to expedite the experiments, the model omits the linear transformer layers. Table 3 displays the comparative results between configurations with and without LIMC. The results reveal that although the parameter count decreases with the removal of LIMC, model performance experiences a decrease of 1.03dB on the Urban100 dataset and 2.19dB on the Manga109 dataset. This underscores the role of the LIMC component in retaining valuable information and filtering out unnecessary features.

Fig. 7
figure 7

Visual comparison of features with and without inter-scale feature integration module (IFIM) and joint feature-guided linear attention (JGLA)

Effectiveness of inter-scale feature integration module (IFIM): We evaluate the impact of cross-scale learining(CL) to ascertain the efficacy of IFIM. Integration of CL results in an average increase of 0.04dB in PSNR value across five datasets, as Table 4 illustrates. This fact underscores the crucial role of CL in cross-scale feature learning within IFIM. For a more comprehensive validation of IFIM, we conduct a comparative analysis with state-of-the-art CNN modules, replacing IFIM with advanced modules such as RCAB, HPB, and LB. To expedite the experiment, we exclude linear transformer layers. Despite a marginal parameter increase, IFIM demonstrates a nearly 0.1dB improvement in PSNR over these advanced CNN modules, as indicated in Table 4. This performance attests to the capability of IFIM to learn valuable features across various scales.

Furthermore, to validate the efficacy of inter-scale feature integration module (IFIM) in feature extraction and pinpoint its focus areas on the image, we executed a meticulous visual analysis. Figure 7a contrasts the output feature maps with and without the incorporation of IFIM. Without IFIM, the network seemingly emphasizes the flat regions interspersed among textures. Notably, texture details are paramount for image reconstruction. With the integration of IFIM, the network shifts its attention predominantly towards the intricate textures of the butterfly. This observation strongly attests to the enhanced feature extraction capabilities of IFIM.

4.4.2 The effectiveness of linear transformer layer

Effectiveness of multi-level contextual feature aggregation (MCFA): MCFA plays a crucial role in guiding joint features for the transformer. We validate its effectiveness by replacing MCFA with the linear embedding layer commonly used in traditional self-attention mechanisms. Table 5 demonstrates that MCFA outperforms the linear embedding layer regarding PSNR metrics while utilizing fewer parameters. Specifically, it achieves a 0.22dB improvement on the Set5 dataset and a 0.4dB enhancement on the Manga109 dataset. These results verify that MCFA effectively integrates adjacent, extended regional, and contextual information, providing optimal feature guidance for the linear transformer.

Effectiveness of joint feature-guided linear attention (JGLA): We compare JGLA with the self-attention mechanism prevalent in traditional transformers. According to the data in Table 5, JGLA achieves superior metrics while maintaining a consistent parameter count. Specifically, it outperforms them by 0.04dB on the Urban100 dataset and by 0.21dB on the Manga109 dataset. Given that JGLA is a form of linear attention, it operates with significantly lower complexity compared than the traditional self-attention mechanism. These findings demonstrate that JGLA not only works with linear complexity but also delivers performance superior to that of the traditional self-attention mechanism, establishing its suitability for the SISR task.

To further ascertain the specific regions emphasized by joint feature-guided linear attention (JGLA) within the network, a detailed visual analysis was conducted. Figure 7b contrasts the focal areas of features when integrating JGLA and when omitting it. Without JGLA, the model predominantly targets the contour information of the butterfly. With the introduction of JGLA, the network accentuates not only texture features but also the surrounding details and contextual information adjacent to the contour. These observations underscore the capacity of JGLA to consolidate surrounding details and contextual information, facilitating the reconstruction process and enhancing the effectiveness of joint feature-guided linear attention (JGLA).

4.4.3 The effectiveness of CNN layer and linear transformer layer

In this subsection, we undertake ablation experiments focusing on the CNN layers and the linear transformer layers. Models comprising exclusively CNN layers and solely linear transformer layers form the basis of this analysis. The data presented in Table 6 illustrate that dependence solely on CNN layers increases parameter usage, while exclusive reliance on linear transformers leads to higher GPU consumption. Employing either component in isolation falls short of matching the performance level of JGLTN, rendering them less viable for practical applications. For instance, on the Manga109 dataset, models utilizing only CNN or linear transformer layers underperform JGLTN by 0.62dB and 0.26dB, respectively. Therefore, a balanced integration of CNN and linear transformer layers optimizes model size, GPU consumption, and performance, enhancing suitability for real-world deployment.

Table 7 Comparison with SISR methods on RealSR

4.5 Real-world image super-resolution

To further validate whether our model is applicable to the super-resolution of real images, we compared it with other lightweight models on the RealSR [60] dataset, including IMDN [17], LP-KPN [60], and ESRT [8]. Notably, LP-KPN is specifically designed for SR of real images. As shown in Table 7, our model outperforms the other methods on the RealSR dataset in terms of metrics, demonstrating that JGLTN is also suitable for SR of real images.

Fig. 8
figure 8

Model performance and size comparison on Set5 (\(\times\)4)

Fig. 9
figure 9

Model performance and Multi-Adds comparison on Set5 (\(\times\)4)

4.6 Model size analysis

In this section, we benchmark our model against others, considering PSNR, parameters, and Multi-Adds. We carry out these experiments on the Set5 dataset at x4 upscaling factor. As depicted in Fig. 8, while JGLTN does not possess the lowest parameter, it excels in the PSNR metric. Specifically, our model has fewer parameters than NGswin and A2N and marginally more than FMEN and LatticeNet-CL. However, it significantly surpasses these advanced methods in terms of PSNR. According to Fig. 9, JGLTN establishes an exemplary balance between PSNR and Multi-Adds, upholding a superior PSNR performance even when accounting for the minimal Multi-Adds. Our proposed method outperforms other models within similar parameter ranges, achieving a judicious balance between model complexity and performance. Therefore, we affirm that our approach is lightweight and efficient, ensuring a favorable balance between model size and performance.

In the experiments, we set the number of CNN layers and linear transformer layers to 8 and explore the performance under conditions with fewer CNN layers and linear transformer layers. As illustrated in Fig. 10, when the number of CNN layers and linear transformer layers is increased, performance is consistently enhanced compared to the model variant with G=2 (where G represents the number of CNN layers and linear transformer layers). Consequently, JGLTN exhibits superior performance across diverse configurations, showcasing its effectiveness and scalability.

Fig. 10
figure 10

PSNR improvement of JGLTN variants with different CNN layers and linear transformer layers number (G) over the smallest JGLTN (G=2) for \(\times\)4 SR

5 Conclusions

In this paper, we propose a lightweight super-resolution network termed the joint feature-guided linear transformer and CNN network (JGLTN). The structure of this network consists of a cascade of CNN layer and linear transformer layer, collectively termed CNN layers and linear transformer layers. The CNN layer incorporates an inter-scale feature integration module (IFIM), and the linear transformer layer encompasses Joint feature guided linear attention (JGLA). Specifically, IFIM aims to extract valuable feature information, while JGLA, integrating with the multi-level contextual feature aggregation (MCFA), brings together adjacent, extended regional, and contextual features to guide the linear attention. Regarding linear attention, we revisit inter-feature similarity calculations and reduce the quadratic computational complexity of self-attention to linear complexity. A wide range of experiments shows that the JGLTN network strikes an impressive balance between performance and computational costs. Future work will involve an in-depth exploration of the intrinsic mechanisms of the JGLTN network to identify more precise feature extraction methodologies and more efficient computational strategies to enhance its capacity for further SISR tasks.