1 Introduction

Single image super-resolution (SISR) is a fundamental low-level vision task, which aims to generate a high-resolution (HR) image from its low-resolution (LR) counterparts. SISR is an ill-posed problem as multiple HR solutions can map to any LR inputs. Hence, plenty of image SR approaches have been proposed to tackle this inverse issue, ranging from early interpolation-based [1, 2] to the latest deep learning-based methods [3,4,5,6,7,8,9,10,11].

Benefiting from the powerful feature representation and end-to-end training paradigm of convolutional neural networks (CNNs), a flurry of CNN-based SISR methods have been presented to learn a mapping from LR input image to its corresponding HR output, obtaining impressive improvement over the conventional approaches. As a pioneer work, Dong et al. introduced CNN to the image SR field and proposed SRCNN [3] with three convolutional layers. To explore more high-level information, SAN [12] and HAN [13] incorporated attention mechanism into SR methods to capture long-range features interdependencies, which achieved noticeable improvement. Later, SwinIR [14] combined the advantage of CNN and Transformer [15] to process large-scale image and model the global relationship simultaneously, it obtained better performance with less parameters. More advanced and complex SISR methods [9, 11, 16,17,18] are proposed to promote the quality of the reconstructed image.

Although remarkable performance has been achieved by these image SR methods, they still have some limitations. Few of them have been able to draw on the spatial correlations of image features to obtain contextual information, resulting in unpleasant super-resolved outputs. In particular, the pixels of image features have different relations with each other, exploration of these contextual relations is helpful for better discriminative learning. Tracing back to the early classical self-example studies [19, 20], they captured self-similar patterns among the whole image to provide the relations between pixels in a global view, showing the significance of spatial relations in image SR field. Thus, how to model the feature spatial dependencies is presented as a key issue.

Fig. 1
figure 1

Visual comparisons for scaling factor \(\times \)3 on “img_028” from Urban100. Our proposed RGCN obtains better visual quality with sharper edges compared with other state-of-the-art SISR methods

In this paper, we propose a novel relation-consistency graph convolutional network (RGCN), exploiting spatial information to enhance feature representation. To be specific, a spatial graph attention (SGA) is proposed for spatial feature encoding, which dynamically models the feature relations with awareness of global information. We first calculate all-region similarities in the feature space and then update features based on the global similarities in SGA. In this way, each pixel feature is learned from the whole image representation through graph convolutional operations. We employ the Gram matrix in this paper to construct global correlations without introducing extra parameters. These correlations formulate the adjacency matrix in SGA to update pixel-wise features through all-region relationships. As shown in Fig. 1, due to the similarity and regularity characteristics of the floor texture, capturing the long-range information provides more clues to recover finer image details. It is noteworthy that modeling the spatial relations between two pixels usually consumes heavily computational resources, especially when the image size is large. We thereby embed a spatial pyramid pooling scheme into SGA to account for this issue. The pyramid pooling constructs the spatial relationships via pixels and grid features, reducing the computational overhead.

In addition, [21] found that the commonly used pixel-wise loss (e.g., \(\mathcal {L}_{1}\) loss) tends to produce over-smoothed results since it is oblivious to the semantic information of the image. As the spatial relation is a natural and static characteristic, that is, the super-resolved SR image should share similar semantic relations with its corresponding LR input image. We hence introduce the relation-consistency loss to maintain the spatial relationships between low-level features and high-level features. Specifically, we minimize the discrepancy between adjacent matrices of the first and last feature layers. It regularizes the model to retain consistent spatial relations after image super-resolution. Overall, our network learns a more reasonable mapping with the above designs for reconstructing images with finer details.

Our main contributions can be summarized as follows:

  • We propose a novel relation-consistency graph convolutional network (RGCN) to enhance the learning ability through contextual information modeling, thus recovering images with finer details.

  • We propose a spatial graph attention (SGA) to encode feature spatial correlations. Within SGA, the adjacency matrix is calculated by the Gram matrix without learnable parameters. Meanwhile, a spatial pyramid pooling scheme is embedded into SGA to reduce computational costs.

  • We propose a simple yet effective loss term, namely relation-consistency loss, to maintain the consistency of spatial information between the LR input image and its corresponding SR output.

  • Extensive experiments on various degradation models demonstrate the superiority of our RGCN in terms of quantitative metrics and visual quality.

The rest of this paper is organized as follows. In Sect. 2, we mainly review the related works about CNN-based SISR methods and describe some relative approaches about attention mechanism and graph convolutional networks. In Sect. 3, we introduce our proposed SR network in detail. To verify the effectiveness of our method, the experimental comparisons and evaluations are presented in Sect. 4. Finally, we conclude our work in Sect. 5.

2 Related work

2.1 CNN-based SISR methods

Recently, deep convolutional neural networks (CNNs) have been extensively studied in various computer vision communities. The powerful representational ability and end-to-end training paradigm of CNN make it widely used in the SISR field. The pioneering work was done by Dong et al. who proposed a shallow convolutional network (SRCNN) [3] to predict the non-linear relationship between the interpolated LR image and HR image, achieving considerable improvement over the traditional methods. Later, Kim et al. designed deeper networks VDSR [5] and DRCN [22] to capture more high-level information based on residual learning [23] and recursive learning. To control the number of model parameters and maintain persistent memory, Tai et al. introduced DRRN [24] with a novel recursive block and further designed MemNet [25] with memory blocks and dense connections.

For the described methods above, the LR images need to be interpolated to coarse HR images with the desired size, which inevitably increases the computational costs and produces side effects (e.g., noise amplification and blurring). To overcome these drawbacks, post-upsampling architecture is proposed and has soon become the mainstream framework in image SR task. Lim et al. introduced a very deep and wide network EDSR [26] by stacking simplified residual blocks in which the unnecessary layers are removed. Similarly, Zhang et al. proposed a residual dense network (RDN) [8] to facilitate effective feature learning through a continuous memory mechanism. Li et al. built SRFBN [6] that utilizes recurrent neural network and feedback mechanism to refine low-level information with high-level image details. Vassilo et al. [27] incorporated multi-agent reinforcement learning and proposed an ensemble GAN-based SR network to increase the quality of reconstructed image. Fang  et al. [18] introduced an accurate and efficient soft-edge assisted network, which employed the image prior knowledge into the network for better image reconstruction. Furthermore, Niu et al. [28] proposed CSN with an efficient channel segregation block that attempts to enlarge the size of receptive fields to capture informative information, thus promoting the quality of super-resolved image.

Compared to these CNN-based methods limited to local relations constraints, in this work, we adopt a spatial graph attention mechanism to model contextual information in spatial dimension.

2.2 Attention mechanism

The attention mechanism was initially proposed by Sutskever et al. in machine translation [29] via giving different weights to the input. Coupled with deep networks, attention mechanism has gained popularity in a variety of high-level vision tasks [30,31,32]. Hu et al. proposed a “squeeze-and-extraction” (SE) [33] block to enhance the learning ability by modeling the channel-wise inter-dependencies. Woo et al. introduced a convolutional block attention module (CBAM) [34] which captures the feature relations along the channel and spatial dimension, respectively. Recent image SR studies have been conducted using attention mechanism and shown remarkable performance gain. Zhang et al. integrated SE block into residual learning and established a deeper network RCAN [35]. The channel-wise attention mechanism utilizes global average pooling to selectively highlight the channel map. Hu et al. presented a CSFM [36] network that combined channel-wise and spatial attention to construct the feature dependencies to enhance the quality of output HR images. Besides, Dai et al. proposed the second-order attention network (SAN) [12] to exploit more powerful feature expressions by using second-order feature statistics. A recent SR approach HAN [13] proposed a layer attention module to model the relationships of features, thus enabling the network to produce the high-quality image. Later, SwinIR [14] utilized several residual Swin Transformer blocks to extract deep features, which obtained impressive performance with less parameters on various low-level vision tasks. To reduce computation costs while maintaining the reconstruction performance, Mei et al. [11] combined sparse feature representation with nonlocal to capture long-range dependencies.

Though the above attention-based SR methods attained noticeable performance, they pay less attention to feature spatial relations modeling. In our work, we aim at modeling the contextual dependencies via graph convolutional operation.

2.3 Graph convolutional network

The concept of graph neural network (GNN) [37] was first proposed by Gori et al. which well-processes the graph-structured non-Euclidean data. The GNN collectively aggregates the node features in a graph and properly embed the graph in a new discriminative space. However, as for regular Euclidean data like images and text, it is hard to apply GNNs straightforwardly [38]. Therefore, defining a convolution-like operation for regular structure data is a major challenge. The graph convolutional networks (GCNs) provide a well-solution to solve this problem. Bruna et al. [39] developed the operation of “graph convolution” based on spectral property, which convolved on the neighborhood of every graph node and produces a node-level output, but led to expensive computational costs. After that, a flurry of graph convolutional studies has been presented. Kipf et al. [40] introduced a fast approximation localized convolution on image classification, which not only simplified the convolution operation but also alleviated the problem of overfitting. Li et al. [41] designed a residual graph convolutional broad network for emotion recognition, which extracts features and abstract features via employing the GCN-based residual block. It not only improves the performance of the network but also extracts higher-level information. In addition, bease on the attention mechanism, Wei et al. [42] constructed a cascade framework between the graph convolutional layers via dense connections, which further enhancing the graph representation capability.

With the property of graph convolution, we propose a spatial graph attention mechanism to exploit global relations of image features. Instead of directly modeling the pairwise relationships of features, we further embed a pyramid pooling scheme in graph convolutional operation, which effectively reduces the computational resources.

3 Proposed method

In this section, we first introduce an overview of our proposed network for image SR. We then describe the details of the designed spatial graph attention and relation-consistency loss, which are the core of our network.

3.1 Network architecture

The overall architecture of our relation-consistency graph convolutional network (RGCN) is shown in Fig. 2 given a low-resolution (LR) image \(I_{\text {LR}}\) and its corresponding super-resolved (SR) image \(I_{\text {SR}}\) as the input and output of our RGCN. As explored in [12, 35], we first use a convolutional layer to extract shallow feature \(F_{0}\) from the initial LR input

$$\begin{aligned} F_{0}=\mathcal {H}_{\text {SF}}(I_{\text {LR}}), \end{aligned}$$
(1)

where \(\mathcal {H}_{\text {SF}}(\cdot )\) represents the convolution operation. \(F_{0}\) is then served as an input for a series of attention-based feature refinement modules (AFRMs). Supposing we have N stacked AFRMs, thus the output \(F_{n}\) of n-th AFRM is formulated as

$$\begin{aligned} F_{n}=\mathcal {H}_{\text {AFRM}}(F_{n-1}), \end{aligned}$$
(2)

where \(\mathcal {H}_{\text {AFRM}}(\cdot )\) stands for the function of AFRM. After obtaining informative features with a set of AFRMs, global feature fusion is further applied to extract global feature \(F_{glo}\) by fusing features from all AFRMs

$$\begin{aligned} F_{glo}=\mathcal {H}_{GFF}(F_{1}, \cdots , F_{N}), \end{aligned}$$
(3)

where \(\mathcal {H}_{\text {GFF}}(\cdot )\) denotes the convolutional layer with the kernel size of 1\(\times \)1 to aggregate features from all modules. We utilize global residual learning before conducting upscale operation by

$$\begin{aligned} F_{\text {DF}}=F_{0}+F_{\text {glo}}, \end{aligned}$$
(4)

where \(F_{\text {DF}}\) denotes the obtained deep feature. Finally, the feature \(F_{\text {LR}}\) is upscaled via the upsampler to generate SR image \(I_{\text {SR}}\). Inspired by [43], we adopt the sub-pixel layer with one convolutional layer followed

$$\begin{aligned} I_{\text {SR}}=\mathcal {H}_{\uparrow }(F_{\text {LR}}), \end{aligned}$$
(5)

where \(\mathcal {H}_{\uparrow }(\cdot )\) stands for the operation of upsampler.

Fig. 2
figure 2

a Architecture of relation-consistency graph convolutional network (RGCN). b The attention-based feature refinement module (AFRM) contains two parts: feature extraction and two-stream attention. Within two-stream attention, the proposed spatial graph attention (SGA) focuses on modeling the relationships between any two pixels, which is the core of our proposed network

3.2 Attention-based feature refinement module

As shown in Fig. 2b, the attention-based feature refinement module (AFRM) contains two parts: an Inception-style feature extraction, and a two-stream attention.

3.2.1 Inception-style feature extraction

Several studies [44] have demonstrated that multi-scale features carry rich information, which are beneficial for accurate SR image reconstruction. To this end, we employ the well-known Inception module [45] into our AFRM as a multi-scale feature extractor, and simplify its structure by remaining two different convolution kernel sizes (i.e., 3\(\times \)3 and 5\(\times \)5). Besides, we leverage dense connections in feature extraction to reuse the features from preceding layers.

3.2.2 Two-stream attention

As shown in Fig. 2b, the two-stream attention is constructed by two-paralleled attention, aiming at modeling the feature dependencies in channel and spatial dimensions, respectively. Within the two-stream attention, we separately learn the feature relations between channels (i.e., channel-wise attention) and pixels (i.e., spatial graph attention) and then aggregate their corresponding outputs to strengthen the feature representation. Specifically, the channel-wise attention (CA) explores the inter-dependencies across feature channels by [33] to adaptively rescale each channel-wise feature, while the spatial graph attention (SGA) dynamically models the feature relations with awareness of global information. In order to take full utilization of the learned information from channel and spatial dimensions, we place CA and SGA in a parallel manner. Moreover, we investigate different arrangements (i.e., parallel and sequential) of CA and SGA in Sect. 4, which experimentally found that parallel arrangement gives a better result than doing in a sequential way.

3.3 Spatial graph attention

In recent CNN-based SISR studies [4,5,6, 26], most of them mainly focus on deeper or wider network architectural design, and the feature dependencies in spatial dimension are rarely explored, thus limiting the learning ability of the network. Thereby, a spatial graph attention (SGA) is designed to build the spatial relationships of the local features, which also can be regarded as a complementary to the channel-wise attention. The SGA encodes the feature spatial relations according to the semantic associations, enhancing the discriminative representation for better image generation.

Given an input feature \(F^{e}_{n}\in \mathbb {R}^{C\times H\times W}\), which has C channels with size of \(H\times W\). As shown in the green rectangle of Fig. 2b, the proposed SGA is composed by a graph convolutional layer and a BatchNorm and a ReLU activation function are followed. A matrix multiplication is further performed to obtain the output \(F^{s}_{n}\)

$$\begin{aligned} F^{s}_{n} = F^{e}_{n}[\sigma (BN(\mathcal {H}_{\text {GC}}(F^{e}_{n})))], \end{aligned}$$
(6)

where \(\sigma (\cdot )\), \(BN(\cdot )\) and \(\mathcal {H}_{GC}(\cdot )\) represent the function of ReLU, BatchNorm and graph convolutional layer, respectively.

3.3.1 Graph convolutional layer

Among image SR task, the standard convolution operation extracts features over the local areas via a predefined filter size (e.g., typically 3\(\times \)3), while neglecting the global information of features. On the other hand, the graph convolution has been widely employed in recent works [40, 42, 46], which has a capability to capture the global similarity between image pixels at arbitrary areas. We thereby combine these two types of convolution and propose a graph convolutional layer (GCLayer) for feature correlations learning, as shown in Fig. 3a.

Unlike the classical convolution that operates on local Euclidean structure, graph convolution tries to learn a function \(\mathcal {H}_{GC}(\cdot ,\cdot )\) by defining edges \(\mathcal {E}\) among nodes \(\mathcal {V}\) in a global graph \(\mathcal {G}\). Given a local feature \(F_{n}^{e}\in \mathbb {R}^{C\times H\times W}\) and an adjacency matrix \(A\in \mathbb {R}^{HW\times HW}\) that is calculated from the shallow feature \(F_{0}\). We first feed feature \(F_{n}^{e}\) into a standard convolutional layer to generate feature \(F_{con}\). Then we reshape the feature \(F_{n}^{e}\) to \(\mathbb {R}^{C\times HW}\). The graph convolution operation is performed on the reshaped feature \(F_{n}^{e}\) and adjacency matrix A, a new feature \(F_{gra}\) is thus acquired by

$$\begin{aligned} F_{gra} = \mathcal {H}_{GC}(F_{n}^{e}, A) = \hat{A}F_{n}^{e}W, \end{aligned}$$
(7)

with

$$\begin{aligned} \hat{A} = \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}, \end{aligned}$$
(8)

where \(\tilde{A} = A + I_{n}\) is the adjacency matrix A of graph \(\mathcal {G}\) with self loops. \(I_{n}\) is an identity matrix, and \(\tilde{D}\) is a diagonal matrix where the element is the sum of \(\tilde{A}\) in each row. W is a layer-specific trainable weight parameter.

Fig. 3
figure 3

Details of a graph convolutional layer and b graph convolution with an embedded pyramid pooling scheme. Compared to (a), the pyramid pooling is added before graph convolutional operation and adjacency matrix generation process, which decreases the computational complexity of matrix multiplication without sacrificing the overall performance

3.3.2 Adjacency matrix

The relationship between any two pixels is characterized by the adjacency matrix A, thus enabling the generated feature \(F_{gra}\) containing the information in nonlocal areas. It is shown in [40] that GCN-based methods propagate information based on the adjacency matrix, which describes the correlations between different nodes. As a result, constructing a proper adjacency matrix is critical for GCN-based methods. Rather than using complicated nonlocal [47] to model the global relations of features, we prefer to generate A via the Gram matrix. The Gram matrix is commonly used in image neural style transfer fields to capture the summary statistics of an entire image, which can be treated as second-order statistics [48]. In our work, we calculate the Gram matrix from shallow feature \(F_{0}\) as our adjacency matrix

$$\begin{aligned} A=<F_{l}^{T}, F_{l}>, \end{aligned}$$
(9)

where A is the inner product between the feature F and its transposed feature \(F^{T}\) in the l-th layer. We here set \(l=0\), which represents the shallow feature \(F_{0}\).

Several advantages can be brought under the above operations: 1) The Gram matrix is calculated with free learnable parameters so that it is easy to calculate and reproduce. 2) The adjacency matrix can be acquired from the arbitrary size of an input image. 3) Sharing the adjacency matrix among the network can decrease the computational burden without a performance drop, as validated in Sect. 4.

3.3.3 Pyramid pooling scheme

Since the graph convolution models the relationship between any two pixels, it requires high GPU memory occupation and expensive computational costs, especially when the image size is large. Considering this, we are concerned about whether there is an efficient way to solve this issue without sacrificing performance. By observing the computing process of graph convolution, we could clearly see that Eq. (7) has two matrix multiplications. More importantly, the former matrix multiplication dominates the computation, in which the computational complexity is \(\mathcal {O}(CH^{2}W^{2})\). We draw a conclusion that the key point of reducing the computational overhead should be on changing H and W. We thus embed a pyramid pooling scheme into graph convolutional layer.

The detailed process of pyramid pooling in a given example feature is depicted in Fig. 4, in which several pooling layers are parallel-placed to produce the pooled features with varied sizes of 1\(\times \)1, 3\(\times \)3, 6\(\times \)6 and 8\(\times \)8. The features with 8\(\times \)8 are omitted for brevity.

As shown in Fig. 3b, the pyramid pooling is added before the adjacency matrix A and features \(F_{n}^{e}\), respectively. To be specific, the shallow feature \(F_{0}\) first through the pyramid pooling and generate the pooled feature \(F_{0,p}\) with the size of \(C\times S\), in which S is the total number of the sampled points in pyramid pooling (i.e., \(S=1^2+3^2+6^2+8^2=110\)). The shallow feature \(F_{0}\) is then reshaped and transposed to \(\mathbb {R}^{HW\times C}\). The Eq. (9) is performed to calculate the adjacency matrix \(A_{p}\in \mathbb {R}^{HW\times S}\). Similarly, the feature \(F_{n}^{e}\) is feed to the pyramid pooling, obtaining the pooled result \(F_{n,p}^{e}\in \mathbb {R}^{S\times HW}\). Thus, the formulation of Eq. (7) is rewritten as

$$\begin{aligned} P_{gra} = \mathcal {H}_{GC}(F_{n,p}^{e}, A_{p}) = \hat{A}_{p}F_{n,p}^{e}W, \end{aligned}$$
(10)

where the output \(P_{gra}\in \mathbb {R}^{C\times H\times W}\) is kept the same size as \(F_{gra}\) in Eq. (7).

By virtue of the spatial pyramid pooling, the computational complexity of the former matrix multiplication in Eq. (7) is decreased to \(\mathcal {O}(CHWS)\), lower than the original \(\mathcal {O}(CH^{2}W^{2})\). In addition to reducing the computations, the spatial pyramid pooling is also parameter-free. Consequently, the pyramid pooling efficiently lowers the computational overhead and maintains the overall performance simultaneously, as demonstrated in Sect. 4.

Fig. 4
figure 4

Detailed process of pyramid pooling scheme. The “Pooling1,” “Pooling3” and “Pooling6” represent the size of pooling layer. In our model, we set pooling size \(\subseteq \) \(\{1,3,6,8\}\). The pooling size of 8 is omitted for brevity

3.4 Relation-consistency loss

The pixel-wise loss (e.g., \(\mathcal {L}_{1}\) loss) is generally used in most CNN-based SISR methods, which aims to minimize the distance between the super-resolved result \(I_{SR}\) and the ground-truth image \(I_{HR}\). Although such loss assists the networks to gain higher performance, it only measures the discrepancy on an entire image at pixel level, and the difference in semantic level is rarely considered, thus resulting in poor visual-quality on image details.

Moreover, as the SISR is an image-to-image task, the semantic relation between input LR image and reconstructed SR image is similar. Ideally, according to the spatially invariant of the global relations, the obtained features from different levels share the similar contextual relations in the training process. For example, when super-resolving a “face” image, the features from one “eye” should be highly related to the other one and be less correlated with the features from the “nose.” This kind of feature dependency does not easily change and is independent of distance, since it is an inherent characteristic of the image, which can be regarded as prior knowledge of the image. Besides modeling the spatial correlation by SGA, the feature relation coherence throughout the entire network also needs to be considered for generating visual pleasing images. To achieve this, we propose a relation-consistency loss to enhance the visual quality by minimizing the discrepancy in semantic level.

As described previously in Sect. 3.3, the Gram matrix in SGA captures global statistics across the entire image. We thereby implement the relation-consistency loss by Gram matrix. The proposed relation-consistency loss tries to encourage the spatial relations to be consistent among different layers, the selection of features from a specific layer thus seems to be a key point. In Table 1, the comparative experiments are conducted that generate Gram matrix from various layers of the network. It can be seen from the experimental results that there is no apparent improvement in frequently calculating Gram matrix from different layers compared to the Gram matrix only generated from low-level feature. This phenomenon indicates the property of contextual relation consistency in an image, which also exactly validates the motivation of our proposed relation-consistency loss. Based on the experimental findings, we utilize the relation-consistency loss to give a constraint between low-level feature and high-level feature, the loss function can be formally given as

$$\begin{aligned} \mathcal {L}_{relation}=\left\| A^0_{p} - A^N_{p}\right\| _{1}, \end{aligned}$$
(11)

where \(A^0_{p}\) and \(A^N_{p}\) denote the corresponding adjacency matrix of feature \(F_{0}\) and \(F_{N}\), respectively.

As a consequence, the relation-consistency loss encourages the network to reconstruct a more realistic image by maintaining the contextual relation consistency between low-level feature and high-level feature.

3.5 Implementation details

3.5.1 Full objective

Similar to [6, 8, 35], we employ \(\mathcal {L}_{1}\) loss to optimize the proposed network via minimizing the difference between the reconstructed image \(I_{\text {SR}}\) and the ground truth image \(I_{\text {HR}}\). Given a training dataset with M image pairs \(\{{I_{m}^{LR},I_{m}^{HR}}\}_{i=1}^{M}\), the reconstruction loss is represented as

$$\begin{aligned} \mathcal {L}_{\text {rec}}=\left\| \mathcal {H}_{\text {RGCN}}(I^{m}_{\text {LR}};\theta )-I^{m}_{\text {HR}}\right\| _{1}, \end{aligned}$$
(12)

where \(\mathcal {H}_{\text {RGCN}}(\cdot )\) is the function of our proposed RGCN. The total loss function is expressed by

$$\begin{aligned} \mathcal {L}_{\text {total}} = \mathcal {L}_{\text {rec}}+\lambda \mathcal {L}_{\text {relation}}. \end{aligned}$$
(13)

where \(\lambda \) is the hyperparameter to control the weights of different losses. The performance of RGCN with different losses is compared in Sect. 4, which verifies the importance of each loss.

3.5.2 Training details

We set the AFRMM number as \(N=20\) in our proposed SR network, and each AFRM has 64 filters (i.e., C=64). Within AFRM, we use 3\(\times \)3 and 5\(\times \)5 convolutional layers in feature extraction. For channel-wise attention in two-stream attention, we adopt 1\(\times \)1 convolutional layer with reduction ratio \(r=16\), which is as similar as [35]. And the hyperparameter \(\lambda \) in the loss function is set as 1, this setting brings a more stable training process and better results.

4 Experiments

4.1 Settings

4.1.1 Datasets and metrics

Timofte [49] has released a high-quality dataset DIV2K for image restoration tasks, which contains 800 training images, 100 validation images and 100 test images. Following [6, 8, 12], we use DIV2K dataset as our training set. For testing stage, we evaluate our SR model on five benchmark datasets: Set5 [50], Set14 [51] BSD100 [52], Urban100 [20], and Manga109 [53]. All the SR results are evaluated with peak signal to noise ratio (PSNR) and the structural similarity index (SSIM) [54] metrics on Y channel (i.e., luminance) of transformed YCbCr space.

4.1.2 Degradation models

To fully demonstrate the effectiveness of our proposed RGCN, three degradation models are used to simulate LR images. The first one is bicubic downsampling by adopting the Matlab function imresize with the option bicubic (denoted as BI for short). Similar to [8, 55], the second one is to blur HR image by Gaussian kernel size 7\(\times \)7 with standard deviation 1.6. The blurred image is then downsampled with a scaling factor \(\times \)3 (denoted as BD for short). We finally produce LR images in a more challenging way. The HR image is first downscaled by bicubic with scaling factor \(\times \)3, and then we add Gaussian noise with noise level 30 on the downsampled HR image (denoted as DN for short).

Table 1 The impact of adjacency matrix A computed from different locations

4.1.3 Training details

During training, data augmentation is performed by flipping horizontally and rotating \(90^\circ \), \(180^\circ \) and \(270^\circ \). In each batch, 16 LR image patches with size of 48\(\times \)48 are extracted as inputs for image SR, and 1,000 iterations of back-propagation constitute an epoch. Our model is trained by AdamW optimizer [56] with \(\beta _{1}=0.9\), \(\beta _{2}=0.999\), and \(\epsilon =10^{-8}\). The cosine learning rate scheduler [57] is adopted by initializing the learning rate as \(10^{-4}\). We use the PyTorch framework to implement our model with Titan V GPUs.

4.2 Ablation study

In this section, we investigate the effectiveness of different components in our proposed method, including attention-based feature refinement module (AFRM), pyramid pooling scheme and relation-consistency loss. All the comparative experiments are trained on DIV2K with scaling factor \(\times \)4 in 200 epochs and further tested on Set5.

4.2.1 Number of N

Fig. 5
figure 5

Convergence analysis of RGCN with different number (N) of attention-based feature refinement modules (AFRM). The performance curves are plotted on DIV2K with scaling factor \(\times \)4 in 200 epochs

We first investigate the basic parameter in our network: the number of AFRM (denoted as N for short), which is directly related to the model size and overall performance. For a clarity comparison, the performance of SRCNN [3] is set as a reference. The convergence curves of AFRM with different numbers of N are presented in Fig. 5. From the results, we observe that RGCN with larger N (i.e., N=20 and N=30) obtain better performance, mainly because the network goes deeper with more AFRMs stacking. Instead, RGCN with smaller N (i.e., N=5 and N=10) suffers some performance drop but still outperforms SRCNN. More importantly, when increasing the number of N from 20 to 30, the capacity of the network goes larger (14.5 M \(\rightarrow \) 23.4 M) without an obvious performance improvement. To better trade-off performance and model size, we thus adopt N=20 as our final RGCN model.

4.2.2 Graph convolutional layer

In order to evaluate the efficiency of graph convolutional layer in SGA, we conduct some comparative experiments, including the impact of sharing adjacency matrix and the embedded pyramid pooling scheme, respectively.

Adjacency Matrix: In our network, the adjacency matrix in graph convolution is calculated from the low-level feature \(F_{0}\) and is shared across the whole network. To verify the efficacy of these strategies, we perform a comparative experiment on computing adjacency matrix from different locations. Two networks are introduced: calculate the adjacency matrix from feature \(F_{0}\) and from each output of AFRM, respectively. Table 1 shows that computing the adjacency matrix from each output of AFRM is time-consuming with no obvious performance gain. We can draw a conclusion that frequently updating the adjacency matrix does not lead to higher performance, instead, reusing the adjacency matrix performs better. This intriguing finding could be due to the contextual relation of image is consistent and spatially invariant, which exactly verifies the motivation of our proposed relation-consistency loss in our RGCN. Considering the balance between efficacy and efficiency, we finally opt to calculate the adjacency matrix from a shallow feature and share it across the whole network to ensure the consistency of feature spatial information.

Table 2 The impact of different scale of pooling in SGA on Set5 with scaling factor \(\times \)4

Pyramid pooling scheme: As discussed in Sect. 3.3, the adoption of pyramid pooling aims to reduce the computational overhead from the vanilla graph convolution. We thus give a quantitative comparison to validate its contribution via the following metrics: FLOPs,Footnote 1GPU memory usageFootnote 2 and the performance in PSNR, which are evaluated on Set5 with a 48\(\times \)48 input image patch. As shown in Table 2, we compare several settings of pyramid pooling, including pooling with one single scale and multiple scales. The graph convolutional layer without pooling is set as a baseline.

In detail, when using single scale pooling in SGA (e.g., pooling(3)), the FLOPs and GPU memory occupation are reduced with 31.25 G and 2,695MB. When we further utilize multiple sizes of pooling (e.g., pooling(1368)), we obtain comparative performance with baseline, and decrease the computational overhead and GPU memory simultaneously. Moreover, when adopting a large size of pooling (e.g., pooling(24)), although the computational resources are effectively reduced, the performance gets worse. Thus, four-scales pooling is selected into SGA for decreasing computational resources.

Table 3 Comparative results achieved by our RGCN trained with different losses for scaling factor \(\times \)4 in 200 epochs

4.2.3 Relation-consistency loss

Following, we conduct ablation experiments to verify the effectiveness of the proposed relation-consistency loss for training our RGCN. It can be observed from Table 3 that the PSNR values of our RGCN decreases from 32.41dB to 32.28dB if the network is trained without relation-consistency loss. This is mainly because, with reconstruction loss, the network only learns to optimize the difference between the generated output and the ground truth in pixel level, while neglecting the relations correspondence of high-level and low-level features in a global view. When further employing relation-consistency loss into our network, better performance is achieved with 0.13dB increased.

Table 4 Comparative results of different arrangement of CA and SGA on Set5 with an upscaling factor \(\times \)4
Table 5 Comparisons with other spatial attention on Set5 with scaling factor \(\times \)4
Fig. 6
figure 6

Visualization results of the LAM on different SR approaches with scaling factor \(\times \)4. The LAM illustrates the contribution of each pixel in the selected image patch (the red box in HR image). The larger the red area, the more pixels are utilized in feature extraction

Table 6 Investigation of each component in RGCN
Table 7 Quantitative results with scaling factor \(\times \)2 on BI degradation model

4.2.4 Stream of CA and SGA

Since CA and SGA have different functions in our network, the placement of them affects the overall performance. We here explore the influence of different arrangements (i.e., parallel and sequential) between CA and SGA. As shown in Table 4, it is clear that the parallel arrangement of CA and SGA infers better representations (PSNR = 32.41dB) than doing sequential and brings only 0.2 M parameters increased, which is brought by the 1\(\times \)1 convolutional layer for the feature fusion. Therefore, utilizing both CA and SGA is crucial while the best-arranging strategy further pushes the overall performance.

Table 8 Quantitative results with scaling factor \(\times \)3 on BI degradation model
Table 9 Quantitative results with scaling factor \(\times \)4 on BI degradation model

4.2.5 Comparisons with other spatial attention methods

In order to evaluate our spatial graph attention (SGA) effectively, we conduct comparative experiments with two related attention methods: nonlocal [47] and spatial attention in CBAM [34]. These two attention methods are used to replace our SGA, and the network only contains the channel-wise attention set as the baseline. Training and testing settings are kept the same as our RGCN for a fair comparison. The results are listed in Table 5. One can clearly see that all methods with spatial attention achieve higher performance over the baseline, which indicates their effectiveness for image SR. Compared with equipping the two well-known spatial attention methods (+ Nonlocal and + CBAM), the performance of our network (+ SGA) is increased by 0.05dB and 0.03dB with 0.3 M and 0.4 M extra parameters.

Fig. 7
figure 7

Visualization comparison of BI degradation on Urban100 and Manga100 with scaling factor \(\times \)4

Fig. 8
figure 8

Visualization comparison of BI degradation on Set14 and BSD100 with scaling factor \(\times \)4

4.2.6 Visualization of local attribution maps

Recently, Gu et al. incorporated attribution analysis into image SR methods and proposed a novel attribution approach called local attribution map (LAM) [59]. The goal of LAM is to find the input features that strongly influence the network outputs, which visualize the results via LAM.

Figure 6 shows the results of LAM in some representative image SR approaches, including CARN [60], EDSR [26], ESRGAN [61], RCAN [35] and SAN [12]. As can be seen, our RGCN involves more pixels with larger receptive fields while CARN and EDSR only extracts few information under a limited region. It is implied that our model could capture long-range feature dependencies for enriching the representational ability of the network, which generates better super-resolved image.

4.2.7 Other components

As stated in Sect. 3, our RGCN mainly contains channel-wise attention (CA), spatial graph attention (SGA) and global feature fusion (GFF). We perform various combinations to verify the effectiveness of each component. Baseline refers to the network only containing feature extraction with 3\(\times \)3 and 5\(\times \)5 convolutional layers, which has a similar size as our RGCN to ensure a fair comparison. As shown in Table 6, the baseline achieves relatively low performance, indicating that blindly stacking more layers cannot lead to better performance. When adding CA, SGA and GFF individually to the baseline, resulting in Case1Case2 and Case3, each of them improves the overall performance efficiently.

Table 10 Quantitative results with BI and DN degradation models

We further equip CA and SGA simultaneously, leading to Case4, the method obtains consistently improvement with 0.04dB and 0.02dB gain as compared with Case1 and Case2. Similar phenomena can be found in using GFF to form our final network Case6, and the overall performance is boosted from 32.39dB to 32.41dB.

4.3 Results with BI degradation

Simulating LR image with a bicubic degradation (BI) model is widely used in image SR settings. For BI degradation model, we compare our RGCN with 16 state-of-the-art SISR methods: SRCNN [3], VDSR [5], EDSR [26], NLRN [62], RCAN [35], RDN [8], SRFBN [6], SAN [12], USRNet [17], HAN [13], SRGAT [16], SCET [63], ESRT [64], LBNet [65] SwinIR [14]. All the quantitative results for three scaling factors over five benchmark are reported in Tables 7, 8 and 9.

Fig. 9
figure 9

Visual comparisons of a BD degradation and b DN degradation with scaling factor \(\times \)3

Table 11 Model size, inference times and performance compare results on Set5 with upscaling factor \(\times \)4

Compared with the CNN-based SR methods, our RGCN achieves the best performance on most benchmark datasets for all scaling factors. Note that on Set14 for scaling factor \(\times \)3, our RGCN and HAN [13] both obtain the best PSNR while our SSIM value is higher than HAN, which indicates our method can reconstruct better result, the same phenomenon can be found in Urban100 with scaling factor \(\times \)4. However, all these CNN-based methods perform worse than the Transformer-based image SR approaches, demonstrating the strong representation ability of the Transformer. Although our method obtains superior performance to most CNN-based methods, we have a large margin with the Transformer-based approaches. In future work, this will be considered to improve our method by combining graph convolutional and Transformer.

We also present the visual comparisons on different benchmark datasets with scaling factor \(\times \)4 in Figs. 7 and 8, respectively. As shown in Fig. 7, the “img_09” from Urban100 has an amount of structured texture. Some CNN-based SR methods cannot recover clearer edges and fine details from the LR images, such as VDSR [5] and LapSRN [66]. And the methods employed \(\mathcal {L}_{1}\) loss ( (e.g., RCAN [35] and SAN [12]) generate over-smoothed results and less fine image details. In contrast, our RGCN reconstructs the HR result with clear structure information and textural details, such as the lines of floor.

4.4 Results with BD and DN degradations

Following [8, 55], we also conduct comparisons on more challenging degradations: BD and DN. Our RGCN is compared with SRCNN [3], IRCNN_C [67], IRCNN_G [67], SRMD [55], RDN [8] and SRFBN [6]. All the results on \(\times \)3 are listed in Table 10, from which we can observe that our network achieves better performance on all datasets. For quantity comparisons, we show the super-resolved results with BD degradation in Fig. 9a. One can see that, for BD degradation, most compared methods recover blurring artificial. Instead, our RGCN suppresses the blurs and recovers texture information. And for the DN degradation model, it can be found that our network removes the noise of corrupted images and recovers more details compared to other methods, as shown in Fig. 9b. These comparative results of BD and DN degradation models demonstrate that our network can be well-adapted to multiple degradation models.

4.5 Model size and inference time

Table 11 shows the comparison results of performance, inference time and model size. PSNR results and inference time are evaluated on Set5 with upscaling factor \(\times \)4. Our RGCN outperforms CNN-based image SR networks (e.g., SAN [12], RCAN [35] and HAN [13]) in terms of performance and model parameters with faster inference times. Despite SRGAT [16] being much smaller than RGCN, the performance is still underperformed. Moreover, the PSNR value of RGCN is slightly lower than that of the Transformer-based method; however, our model has a comparable model size and costs less inference time than SwinIR [14]. Thus, the experimental result provides an implication that our RGCN has a good balance on model size and performance.

5 Conclusion

In this paper, we propose a relation-consistency graph convolutional network (RGCN) for accurate image SR, which captures contextual information in the spatial dimension. To be specific, we utilize a spatial graph attention (SGA) to dynamically model global dependencies via graph convolutional. To draw the pairwise relationships of image features, the Gram matrix is adopted to calculate the adjacency matrix in SGA, and then share it across the entire network. We further embed a pyramid pooling scheme in SGA to reduce expensive computational costs and memory occupation without sacrificing the overall performance. Additionally, a relation-consistency loss is introduced, which gives a constraint on spatial relationships between low-level feature and high-level feature in the semantic level. Extensive experiments demonstrate the superiority of our RGCN in terms of quantitative and visual quality.