1 Introduction

Image super resolution is a basic image processing technology, which aims to generate high resolution (HR) images on the basis of degraded low resolution (LR) images. In recent years, the single image super resolution (SISR) method based on deep convolutional neural networks (CNNs) has been significantly developed compared with the conventional SISR models [1,2,3, 3,4,5,6,7,8,9,10,11,12,13,14], and has been widely applied in various fields such as medical images [15, 16] and satellite imaging [17]. However, most existing SISR pre-training models can only perform single image restoration for LR images, which consumes additional computer resources. In addition, the fact that upsampling factors can only be integers limits its application in real-world scenarios.

In order to overcome the above problems, the up-sampling network is redesigned. Lim et al. [18] developed a multi-scale deep super resolution architecture (MDSR), which uses three different upsampling branches (×2, ×3, ×4) to generate HR images of different sizes from degraded images in the same model. In order to extend the scale factor to non-integer domains, Hu et al. [19] proposed a new advanced method for image reconstruction at arbitrary scale, called magnification-arbitrary network (Meta-SR), which uses several fully connected layers to predict the corresponding pixel values in HR images. This new network is a pioneering work in super resolution of arbitrary scale images. By using local implicit image function (LIIF) to learn the continuous representation of HR images, Chen et al. [20] achieved attractive SR results, which not only eliminated checkerboard artifacts in MetaSR, but also generated images with higher (×6, ×8) resolution, while maintaining considerable visual perception. Lee et al. [21] use two-dimensional (2D) Fourier space to form a local texture estimator (LTE). In terms of backbone network, Wang et al. [22] developed dynamic scale-wise plug-in module (ArbSR) based on the existing SISR network to complete the task of image super resolution at arbitrary scale. Compared with the upsampling strategy LIIF, this specific neural implicit function can capture more image details. Li et al. [23] propose an enhanced dual branches network (EDBNet), which mix up pixel embedding and scale information to generate arbitrary-scale SR images in the upsampling network.

Compared to the traditional single scale upsampling module, the above arbitrary scale upsampling network method has better adaptability and flexibility. There is no denying that ArbSR does improve the backbone’s representation ability to encode arbitrary-scale images with plug-and-play modules. However, it has a large number of parameters and image processing is slow. In addition, other methods have also been adjusted in the up-sampling module, but we think it can be further improved.

In this letter, we design a novel multi-scale cross-fusion network (MCNet), which has an excellent performance in arbitrary scale reconstruction. Firstly, the scale-wise module (SWM) combines the scale information and pixel features to effectively improve the representation capability of the backbone network for arbitrary scale images. Moreover, we design a powerful multi-scale cross-fusion module (MSCF) after the backbone network to enrich the spatial information and remove the redundant noise from the deep features. In our MSCF, deep feature maps in different sizes are used to conduct interactive learning from each other. The experiments with four benchmark datasets show the highly advantageous performance of our MCNet method.

The main contributions of this letter focus on the following aspects: 1) We propose a novel multi-scale cross-fusion network (MCNet), which not only removes the blurring artifacts for efficient and accurate image reconstruction but also delivers the most advanced results compared with other SR methods. 2) To see further improvement in feature representation ability, we use the scale-wise module (SWM) to combine scale information with pixel features, effectively fusing two independent variables together. 3)We design a multi-scale cross-fusion module (MSCF) after the backbone network, which consists of two basic components: a) multi-downsampling convolution layer (MDConv) uses convolutional layers of different kernel sizes to generate smaller feature maps, and b) dual-spatial mask (DSM) is a dual spatial mask module that learns interactive information from the features with different scales.

Fig. 1
figure 1

The network structure of our proposed MCNet, which contains three main parts for: 1) feature extraction network, 2) cross-fusion module and 3) image reconstruction network

2 Proposed method

2.1 Outline

As shown in Fig.1, our MCNet framework mainly consists of three parts: 1) feature extraction network, 2) multi-scale cross-fusion module (MSCF) and 3) arbitrary-scale upsampling network.

First, the extracted features \(F_d\) is obtained by performing a \(3\times 3\) convolutional layer and an existing SISR backbone network on the input LR image; i.e.,

$$\begin{aligned} F_d = E_\phi (Conv_{3\times 3}(X)) \end{aligned}$$
(1)

where \( E_\phi \) denotes the backbone network with multiple stacked residual blocks [18] and novel SWM modules. We will discuss the new module in more depth in the next section. The second part of the MCNet framwork is our proposed MSCF, which makes a significant contribution to generating clean and abundant features \(F_{rid}\) given by

$$\begin{aligned} F_{rid} = Q(F_d) \end{aligned}$$
(2)

where \(Q(\cdot )\) will be described in more detail in a later section.

In the upsampling network, we incorporate scale information for image reconstruction by adding a new SGU module to another branch, which can accomplish a tailored image restoration task for our SR model. After the enrichment of features, \( F_{rid} \) and its mapping coordinate C in HR image space are used to facilitate the next stage on image upsampling network. Similar to the LTE [21], an HR image Y is generated through a continuous image upsampling module with local texture estimator \( G_{lte} \); i.e.,

$$\begin{aligned} Y = \sum _{i=1}^3{W_i \odot G_{lte}(F_i, C_i)} \end{aligned}$$
(3)

where i is the index of an offset latent code around \( F_{rid} \) and \( W_i \) is the corresponding weight of each coordinate.

Consider a set of \( ({I^{LR}_i, I^{HR}_i})^N_i \) that contains N \( LR-HR \) pairs, where \( I^{LR}_i \) is an input LR image and \( I^{HR}_i \) stands for the corresponding ground-truth(GT) image. We choose the \( L_1 \) loss function to optimize our network during training.

$$\begin{aligned} \Theta ^* = \underset{\Theta }{\arg \min }\ \frac{1}{N} \sum _{i=1}^{N} \Vert \Omega (I^{LR}_i) - I^{HR}_i\Vert _1 \end{aligned}$$
(4)

where \( \Omega \) denotes the set of learning parameters in our proposed model.

Fig. 2
figure 2

Architecture of the multi-scale cross-fusion module (MSCF)

2.2 Scale-Wise Module (SWM)

Inspired by the idea of multi-modal [24], a great number of researches have taken full advantage of multimodal fusion algorithm to combine two independent variables. For example, the study of reading image information is to input two independent variables, such as image and text, into the backbone network to form two interleaving branches so that the corresponding information can communicate with each other. Among them, the researchers choose to design a fusion module between the two branches, which can perform multi-modal learning on two completely different variables to establish a close relationship (Fig. 2).

On the basis of the prior information above, we design a plug and play module, called scale-wise module, after each residual block of the EDSR [18] backbone network. Compared with ArbSR, this module requires less computation and fewer parameters, which can effectively combine the scale information with the image pixel features and fully improve the ability of backbone network to represent multi-scale images.

As shown in Fig. 3, we assume that F represents the pixel feature of the image and S is arbitrary-scale information, so that the working principle of the intelligent scale module can be expressed as

$$\begin{aligned} W_{F S}=\delta _s\left[ f_1(F) \otimes f_2(S)\right] \end{aligned}$$
(5)

where \(\delta _s\) is the sigmoid activation function, \(f_k(k=1,2,3)\) represents different 1×1 convolution layers, and \(\otimes \) denotes the matrix multiplication algorithm. \(W_{F S}\) is the pixel-scale weight matrix, which represents the result of the activation function mapping from 0 to 1 after the communication of image pixel features and scale information. This process is called pixel-scale attention manipulation. Then, the attention matrix \(W_{F S}\) is dotted with F through the convolution layer, passing more useful spatial information to the next residual block of the EDSR [18].

$$\begin{aligned} F_S=W_{F S} \cdot f_3(F)+f_3(F) \end{aligned}$$
(6)

It should be noted that the attention matrix \(W_{F S} \in R^{B \times 1 \times H W}\) and the pixel feature matrix \(f_3(F) \in R^{B \times C \times H W}\), so the pixel-scale attention operations are spatial attention mechanisms that perform the same operations on all channel dimensions. The weight matrix \(W_{F S}\), which is obtained by multiplying pixels and scale feature matrix, can be adapted to scale information to further discriminate the whole image space. Finally, we apply the reshape function to transform the tensor \(F_S\) with a shape of \({B \times C \times H W}\) into \({B \times C \times H \times W}\). This tensor will serve as the input for the next residual block in the encoder.

Fig. 3
figure 3

Architecture of scale-wise module(SWM)

2.3 Multi-Scale Cross-Fusion module (MSCF)

To further improve the quality of the reconstruction images in backbone network, we design a powerful module consisting of multi-downsampling convolutional architecture (MDConv) and dual spatial mask (DSM). Referring in Fig. 2, in MDConv module, a set of convolutional layers are conducted to downsample the deep features \( F_d \) delivered by the SR backbone network; that is,

$$\begin{aligned} F_{td}^{k} = Conv\downarrow _k(F_d^k) \end{aligned}$$
(7)

where \( k (k=\frac{1}{8}, \frac{1}{4}, \frac{1}{2}, 1)\) represents the downsampling factor. \( F_{td} \) is the downsampled feature with a specific scale, which contains more plentiful global features of images. By performing the interpolation in space and concatenation in channel, the generated feature \( F_{td}^{k} \) is used to redefine the new feature map \( F_{cd}^{k} \). Note that we use bilinear interpolation to make feature maps of different scales the same size. MDConv provides many feature maps with different receptive fields and structural information for the next step. Then, the multi-scale features \( F_{cd} \) are fed into our MSCF sub-module dual spatial mask (DSM) in succession through performing the communication as follows:

$$\begin{aligned} D^{\frac{1}{4}}, C^{\frac{1}{4}}= & {} DSM_1(F_{cd}^{\frac{1}{4}}, F_d^{\frac{1}{8}})\end{aligned}$$
(8)
$$\begin{aligned} D^{\frac{1}{2}}, C^{\frac{1}{2}}= & {} DSM_2(F_{cd}^{\frac{1}{2}}, C^{\frac{1}{4}})\nonumber \\ D^{1}= & {} DSM_3(F_{cd}^{1}, C^{\frac{1}{2}})\nonumber \\ F_{od}= & {} Coi(D^{\frac{1}{4}}, D^{\frac{1}{2}}, D^{1}) \end{aligned}$$
(9)

where \( D^k \) and \( C^k \) are the corresponding outputs by the DSM module, and Coi represents the corresponding operation of interpolation and concatenation. The operator \( DSM_i(\cdot ) \) denotes our dual spatial mask (DSM), which learns attention weights from two feature maps with different scales and its detailed structure is shown as follows:

$$\begin{aligned} F= & {} F \cdot SM(C\uparrow ^2) + F\nonumber \\ C= & {} C \cdot SM(F) + C \end{aligned}$$
(10)

where F and C denote two different inputs of the mask module, \( \uparrow ^2 \) is the operation for \( \times 2 \) upsampling. \( SM(\cdot ) \) is the spatial gate mechanism. Note that, two inputs of different sizes are adjusted to the same shape through the processing of our DSM. \( D^k \) is served as a part of the final output from MSCF, while \( C^k \) is used for the interactive learning in the next sub-module. They are utilized to learn additional textures and structures from each other.

Table 1 Quantitative Results of Sate-of-The-Art Arbitrary-Scale SR Methods

3 Experiment results

3.1 Implementation details

As same as the setting in EDSR [18], we train our MCNet with DIV2K datasets. For testing, our MCNet are evaluated by using four standard benchmark datasets: Set5 [7], Set14 [25], B100 [26] and Urban100 [27]. During training, 16 degraded patches of size 48*48 are used as a batch input. For upsampling part, we sample random scale factors in uniform distribution U(1, 4). Each example in a batch has different upsampling target. Adam [28] optimizer with \( \beta _1=0.9, \beta _2=0.999 \) is utility in the MCNet for 1000 epochs. The learning rate is initialized to \( 1\times 10^{-4} \) and decreased by factor 0.5 at [200, 400, 600, 800].

Fig. 4
figure 4

Qualitative comparison of different methods on Urban100 datasets

Table 2 Memory usage and time consumption compared with other arbitray-scale SR models for \( \times 2 \) upsampling
Table 3 Computer resource consumption compared with ArbSR method for \( \times 2 \) upsampling

3.2 Performance evaluation

Six SOTA SR networks are used to compare with our proposed MCNet method, including EDSR [18], Meta-SR [19], ArbSR [22], LIIF [20], LTE [21] and EDBNet [23]. Table 1 displays the Peak Signal-to-Noise Ratio (PSNR) values for four benchmark datasets at upscaling factors ranging from \(\times 2\) to \(\times 8\). It is important to note that EDSR [23] belongs to the category of single-scale image super-resolution models, and we only conducted training and testing on standard scales of \(\times 2\), \(\times 3\) and \(\times 4\). We can find that our proposed MCNet significantly outperforms EDBNet [23] on the urban100 dataset. Specifically, compared with the EDBNet [23] model and our MCNet method, the PSNR results show improvements at medium scales of our model. Furthermore, we also show a visual comparison in Fig. 4. For the challenging details in “img044” and “img054”, most previous work lost some crucial details when restoring the images. On the contrary, our MCNet achieves better results by recovering more detailed components. In addition, as shown the cost consumption of four arbitrary-scale image super-resolution models in Table 2, we can find that the MCNet model only increases a little additional computation resources. In a word, compared with other arbitrary scale super-resolution methods, our model has the most advanced image reconstruction performance although it adds extra computational cost.

3.3 Ablation study

To confirm the effectiveness of the scale-wise module (SWM), we compared ArbSR’s Scale-Aware Feature Adaption (SAFA) to the plug-and-play module in this letter. Table 3 shows that SWM has a very small number of parameters and consumes less computer resources. Moreover, SWM module has better performance for SR image restoration than the SAFA module of ArbSR. The PSNR results is tested on Urban100 with \( \times 4 \) upsampling. All in all, it is very rare that our SWM module is more superior than SAFA module with very little resource consumption.

We all know that if the image is downsampled continuously, the small scale image will have more comprehensive global information and less noise. In order to generate cleaner and richer high-resolution images, we design a Multi-Scale Cross-Fusion Module (MSCF) after the feature extraction network of the model, which includes multi-downsampling convolutional architecture (MDConv) and dual spatial mask (DSM). After several down-sampling processes, MDConv can obtain feature maps of various sizes . The multi-scale texture and structure information in different feature maps provides an important information basis for later learning of DSM. Diverse information uses DSM interactive learning to absorb more semantic information from each other, so that the output high-dimensional feature map contains clean and rich texture and structure information.

Table 4 shows the ablation experiments of the DSM and MDConv. In our two variants, we validate the effectiveness by performing on the dataset Urban100 with six scale factors from 2 to 8. All networks are pre-trained on the EDSR [18] backbone for 1000 epochs. The definitions of -D, -M indicate that MCNet removes the corresponding components of DSM, and MDConv. Compared MCNet with MCNet(-D), We observe that using DSM achieves further improvement particularly with the upsampling scales that in training distribution, which is consistent with our motivation. To confirm the validation of MDConv, we also compare MCNet to MCNet(-M), which enhances the quality of both in-scale and out-of-scale factors.

Table 4 An Ablation Investigation of Two Variants Performed on the Dataset Urban100

4 Conclusion

In this letter, we propose a novel scale-guidance fusion network (MCNet) for the existing SISR network with arbitrary scaling factors. The designed scale module (SWM) integrates the scale information and pixel features to effectively improve the representation ability of arbitrary scale images. In addition, the multi-scale cross-fusion module (MSCF) cleans redundant noises of deep feature maps and provides abundant space embedding for subsequent image restoration. The comprehensive evaluation has demonstrated that our MCNet achieves superior performance compared to state-of-the-art arbitrary-scale works.