Introduction

Magnetic resonance imaging (MRI) is a non-ionizing radiation imaging technique that is crucial for diagnosing conditions such as brain tumors, intracranial infections, cerebral ect. The quality of MRI images is affected by many factors like the signal-to-noise ratio (SNR) and spatial resolution. In clinical practice, MRI scanning thickness is usually increased to decrease the imaging time to meet SNR requirements and reduce motion artifacts caused by the movements of scanned objects. However, increasing the scanning thickness will produce low-resolution (LR) MRI images, which hinders the precision of follow-up analyses and diagnoses. As a post-processing method, super-resolution (SR) is able to effectively enhance the resolution of MRI scans without upgrading or replacing existing hardware equipment. However, the SR problem remains challenging due to there being multiple solutions for any LR-to-high-resolution (HR) mappings.

In recent years, deep learning represented by convolutional neural networks (CNNs) has become increasingly attractive in solving the SR problem. They use series of convolution operations to automatically extract hierarchical features and optimize parameters via numerous samples, which has achieved impressive performance in restoring and enhancing the degraded sharpness, contrast, and texture of the HR MRI image. Current efforts mainly focus on increasing network depth or width through various techniques to improve the ability of fitting SR mapping functions. However, increasing the network size does not bring about significant MRI image SR performance improvements, because it is continuously extracting redundant features with heavy computing load. In the diagnosis of certain diseases (glioma), it is difficult to extract all the necessary information from a single MRI modality to ensure clinical accuracy and examination intensity. Therefore, the complementary characteristics of multi-modality MRI image information are usually used to improve the diagnostic accuracy of certain diseases.

In this paper, we propose cross-modality reference and feature mutual-projection (CRFM) to enhance the spatial resolution of MRI image. Our model receives two types of inputs: LR images of a certain MRI modality are used to generate SR images, and HR volumes of another modality provide reference features for accurate SR. Specifically, the CRFM network uses cascaded residual channel attention (RCA) blocks to extract features from LR inputs. In this process, we propose a feature mutual-projection (FMP) method according to the cross-scale similarity of the image to capture the internal correlations of repeated plaques in features at different scales. Moreover, we extract the gradients of HR images of referenced imaging modality and feed them into the FMP module, complementing the true external HR details for the SR task. At the tail of CRFM network, we upscale all feature maps and then fuse these maps with the mutual projected features and reference gradients to predict the missed HR details. In addition, cross-scale residual learning is adopted to facilitate parameter optimization. Extensive experiments show that our CRFM surpasses some existing 3D brain MRI image SR techniques.

The contributions of this paper are summarized as follows:

  • We propose an reference-based MRI image SR method that fully utilizes image gradients from reference MRI modality. Moreover, for the case in which no reference exists, we propose a single image SR method cross-scale feature transformer (CFT) network that only uses the self-similarity of different scale MRI images to reconstruct HR details.

  • We design a feature mutual-projection method by cross-scale feature matching via transformer according to the self-similarity of MRI images, which can be flexibly inserted anywhere in the SR network.

  • We develop a parallel channel and spatial attention to achieve efficient feature refinement and enhancement, and meanwhile producing the HR details.

The remainder of this paper is organized as follows: Section 2 reviews the Related Work. The details of the proposed CRFM method are provided in Materials and Methods. The implementation details, ablation study, and comparisons with state-of-the-art methods are presented in Results. Finally, the last section concludes this paper (Conclusion).

Related Work

To deal with the ill-posed resolution reconstruction problem, various techniques have been developed and can be roughly grouped into interpolation-based, reconstruction-based [1, 2], and learning-based SR methods [3,4,5,6]. Among them, interpolation algorithms are easy to perform but may lead to blocking, ringing, and jagged artifacts. In contrast, reconstruction-based SR methods simulate the process of MRI and introduce priors to improve image quality. As one kind of data-driven technique, learning-based SR methods learn complicated LR-to-HR mappings from large numbers of training samples. Although reconstruction-based and traditional learning-based methods have made noteworthy progress in enhancing the resolution of MRI images, it is difficult to utilize the insufficient additional information and limited representation capabilities to solve the challenging MRI SR reconstruction problem.

Recently, deep learning has achieved impressive success in natural image SR, and some techniques have been introduced to address the MRI SR problem [7, 8]. There are two different types of CNN-based MRI SR methods including single image SR (SISR) and reference-based SR (RefSR). SISR methods focus on learning a spatial mapping functions to restore HR images from a given LR acquisition. Dong et al. [9] initiatively designed a three-layer reconstruction network (called SRCNN) to enhance the resolution of two-dimensional (2D) natural images. Then, Pham et al. [10] proposed a three-dimensional (3D) SRCNN model to produce HR 3D brain MRI images. It is well-known that deep learning networks could enhance the representation ability by increasing their depth and width. However, these models may be difficult to optimize due to the vanishing or exploding gradient problem. To alleviate training difficulty, residual learning [11, 12] and dense connections [13, 14] have been widely applied by MRI image SR networks. Shi et al. [15] integrated global connections and local skips into a progressive wide residual network to reconstruct HR MRI slices. Similarly, Oktav et al. [16] and Giannakidis et al. [17] adopted residual learning in increasing the spatial resolution of cardiac and brain MRI images, respectively.

In addition, some well-designed strategies have been developed in unlocking the restoration capacity of MRI image SR networks, such as multi-scale learning [18], attention mechanism [19, 20], generative adversarial networks [21, 22], and multi-branch networks [23]. Wang et al. [24] designed a 3D attention mechanism to make the network concentrate on meaningful features and regions that are more conducive to improving the resolution of MRI image. Wang et al. [25] constructed a convolution and deconvolution model to increase the resolution of 3D MRI scans, which used convolution and deconvolution kernels in parallel to obtain different levels of features and enrich the feature extraction methods. Zhao et al. [26] put forward a channel splitting method to input features into two sub-networks with different information transmission capabilities. Through multiple channel splitting and fusion operations to fuse different levels of features and reconstruct the 2D MRI slices.

Compared with SISR, RefSR introduces the information of one or more known HR images as additional references to reconstruct high-frequency details. In general, the references contain objects, scenes, or textures similar to those in the LR images [27], for example, videos or images obtained from different viewpoints of the same scene. Zhang et al. [28] and Yang et al. [29] designed an enumerated feature patch matching and fusion method for introducing HR details from the referenced images into the SR process. Zheng et al. [30] developed an end-to-end SR neural network by combining the optical flow-based warping process and image synthesis to transfer high-frequency features from HR references. Zhang et al. [31] introduced a progressive feature alignment and selection module, which performs feature selection in a deliberate manner to align the reference image, thereby enabling more accurate transfer of reference features into input features, thus achieving higher precision in the process. Cao et al. [32] improved the deformable convolutional technique, allowing for the acquisition of relevant features from the surrounding areas of the reference image based on established correspondences. By aggregating these features along with pertinent textural details, they ultimately synthesize visually superior high-resolution (HR) images. Huang et al. [33] designed a lightweight RefSR module that harnesses the high-frequency information from a high-resolution reference image. This module is employed in an inverse degradation process to restore the missing fine textures and details, thereby enhancing the overall visual quality. Since RefSR can find more meaningful clues according to the referenced objects, the quality of the SR images can be considerably enhanced.

Although CNN-based MRI image SR methods have evolved greatly, their potential has not been fully exploited, as the internal priors of LR images and external priors of multi-modality MRI have been neglected. In fact, priors are essential for correctly recovering clear textures and edges, especially for deep learning methods that can automatically extract image features. As a multi-parametric imaging technique, MRI can produce complementary multi-modality scans with different tissue contrasts, such as T1w, T2w and FLAIR images, which have been widely used to diagnose and evaluate clinical diseases. Therefore, to improve the resolution of MRI images of a certain modality, information from other imaging modalities could be used as ideal references. In 2019, Pham et al. [11] inputted images with different contrasts into the network to explore the impact of images with different contrasts on model performance. The results showed that both FLAIR and T2w can improve the resolution of T1w images. In 2021, Feng et al. [34] utilized the information complementarity of multi-modality MRI images to propose a first level non asymptotic network and a two-stage asymptotic network based on residual asymptotic thinking to solve the problem of MRI super-resolution reconstruction. Sarasaen et al. [35] utilized the different organizational structure information of the brain and the longitudinal information of multi-modality data collected from different directions to improve the performance of super-resolution networks. In 2022, Kang et al. [36] established an associative memory network between T1w images and T2w images to learn high-frequency features from T1w images to T2w images at different scales. In 2023, Yang et al. [37] integrated a multi-contrast MRI observation model into a deep unfolding network framework, explicitly capturing and leveraging the complex relationships between different contrasts through an iterative optimization process for super-resolution reconstruction. Huang et al. [38] proposed a dual-cross attention multi-contrast super-resolution framework that captures and fuses shareable information across multi-contrast images by utilizing highly downsampled reference images. In 2024, Kang et al. [39] constructed an end-to-end mapping network for multi-resolution analysis, incorporating a low-frequency filtering module to avoid interference from redundant T1-weighted information while effectively guiding T2-weighted super-resolution reconstruction using informative T1-weighted data.

Materials and Methods

This section details our proposed CRFM method. Let \(F\left( \cdot \right)\) with the parameter \(\theta\) represent the mapping function given by CRFM network. The goal of \(F\left( \cdot \right)\) is to generate an estimation which is as similar as possible to real HR MRI image \(\text {I}_{\text {HR}}\) according to an input degraded counterpart \(\text {I}_{\text {LR}}\) of a specific imaging modality and referenced MRI image \(\text {I}_{\text {Ref}}\) of other modalities. The following parts present the overview and main components of the CRFM network.

Fig. 1
figure 1

The architecture of CRFM network, where the UFF module and FMP module respectively represent the upsampling and feature fusion module and feature mutual-projection module

Network Overview

The architecture overview of CRFM network is outlined in Fig. 1. To generate SR images that approximate the ground-truth MRI image, we extract the additional gradient from \(\text {I}_{\text {Ref}}\) and transfer it into the backbone of the CRFM network. The backbone focuses on extracting features from \(\text {I}_{\text {LR}}\) and fusing reference feature maps.

To extract the initial feature maps \(\text {X}_0\) from \(\text {I}_{\text {LR}}\), we first adopt a convolution layer that followed by an LReLU activation function. Then, we extend the RCA module in [40] to 3D space and replace the ReLU with LReLU. After that, n improved RCA modules are cascaded as the backbone to map initial features \(\text {X}_0\) to deep feature maps \(\{\text {X}_\text {1}, \text {X}_\text {2}, \dots ,\text {X}_{n}\}\), which are finally connected as \(\text {X}_{c}\).

In the backbone, a feature mutual-projection strategy is proposed to enhance meaningful texture features. Let \({F}_{{FMP}}\left( \cdot \right)\) denote the function represented by the proposed FMP module; the output of this function can be obtained as

$$\begin{aligned}{}[ \text {X}_{m}^{'},\text {Y}_{m}^{'} ] ={F}_{{FMP}}\left( \left[ \text {X}_{m},\text {X}_{\text {Ref}} \right] \right) \end{aligned}$$
(1)

where \(\text {X}_{\text {Ref}}\) refers to the reference features and m is the index of the RCA module. Here, FMP produces feature maps \(\text {Y}_{m}^{'}\) with high frequency textures and the downsampled \(\text {X}_{m}^{'}\) from the corresponding \(\text {X}_m\) and \(\text {X}_{\text {Ref}}\).

Then, \(\text {X}_{m}^{'}\) is input into \(\text {RCA}_{m+1}\), allowing the CRFM network to explore more important cross-scale and cross-modality information. Following \(\text {RCA}_{n}\), the output of all RCA modules and \(\text {X}_{m}^{'}\) are concatenated along the channel direction. Finally, \(\text {X}_{\text {Ref}}\), \(\text {Y}_{m}^{'}\), and \([ \text {X}_1,\cdots \text {X}_{n},\text {X}_{m}^{'} ]\) are input into the upsampling and feature fusion (UFF) module to upsample and fuse the HR details, producing the output \(\text {Y}_f\). Similar to [25, 41], global cross-scale residual learning is utilized to improve the learning efficiency of SR network. The final produced SR image is obtained as

$$\begin{aligned} \text {I}_{\text {SR}} =\text {Y}_{f}+\text {I}_{\text {LR}}^{\uparrow } \end{aligned}$$
(2)

where \(\text {I}_{\text {SR}}\) represents the desired estimation corresponding to real HR MRI scans.

Cross-Modality Reference

MRI images of different modalities (such as T1w and T2w) have highly similar edges and structures, but their contrasts are different. This contrast difference causes information interference if the images are fed directly into the RefSR network with the original \(\text {I}_{\text {Ref}}\). Considering that the gradient indicates the sharpness and structure of an image, we input the gradients of \(\text {I}_{\text {Ref}}\) into the backbone. The gradients of the reference HR image \(\text {I}_{\text {Ref}}\) are obtained as

$$\begin{aligned} \left\{ \begin{array}{l} {G}_{h}\left( \text {I}_{\text {Ref}}\right) =\text {I}_{\text {Ref}}\left( h+1,w,l \right) -\text {I}_{\text {Ref}}\left( h-1,w,l \right) \\ {G}_{w}\left( \text {I}_{\text {Ref}}\right) =\text {I}_{\text {Ref}}\left( h,w+1,l \right) -\text {I}_{\text {Ref}}\left( h,w-1,l\right) \\ {G}_{l}\left( \text {I}_{\text {Ref}}\right) =\text {I}_{\text {Ref}}\left( h,w,l+1 \right) -\text {I}_{\text {Ref}}\left( h,w,l-1\right) \\ \triangledown{G}\left( \text {I}_{\text {Ref}}\right) \, =\Vert \left( {G}_{h}\left( \text {I}_{\text {Ref}}\right) , {G}_{w}\left( \text {I}_{\text {Ref}}\right) , {G}_{l}\left( \text {I}_{\text {Ref}}\right) \right) \Vert _2 \end{array} \right. \end{aligned}$$
(3)

where \({G}_{h}\left( \cdot \right)\), \({G}_{w}\left( \cdot \right)\), and \({G}_{l}\left( \cdot \right)\) denote the gradient map extraction operation in the height, width, and length directions, respectively, and \(\triangledown {G}\left( \cdot \right)\) represents the operation of extracting the gradient strength. Then, a convolution layer is utilized to capture the structural dependency and spatial relationship between \(\text {I}_{\text {Ref}}\) and the corresponding output features \(\text {X}_{\text {Ref}}\). As is known that MRI is a native multi-modal imaging technique, thus we can flexibly obtain a lot of desired information of different modalities as available references for MRI image accurate SR. For example, we can introduce the gradients of T2w images as references when restoring HR T1w images.

Fig. 2
figure 2

The architecture of the proposed FMP module, where \(\odot\) represents inner product. The patches in \(q_i\) and \(k_j\) are matched according to their similarity. The corresponding patches in \(v_j\) replace the LR patches in \(k_j\), and finally produced the features in HR space

Feature Mutual-Projection

Rich relevant texture details at different scales are conducive to addressing the SR problem [42]; therefore, we propose a feature mutual-projection (i.e., FMP) method which mines meaningful textures through capturing cross-scale and cross-modality self-similarity property of MRI images. The detailed FMP process is shown in Fig. 2. In contrast to introducing the information of external HR samples, as described in [43, 44], our FMP utilizes the internal self-similarity in 3D MRI images. Thus, our method reduces the interference of erroneous pathological information from external reference samples.

In the FMP module, the mutual-projection is applied to extract and combine different scale feature maps. Given inputs \(\text {X}_{m}\) and \(\text {X}_{\text {Ref}}\), the FMP module outputs \(\text {X}_{m}^{'}\) and \(\text {Y}_{m}^{'}\) as follows:

$$\begin{aligned} \left\{ \begin{array}{l} \text {Y}_{m}^{'}={F}_{\text {up}}\left( \text {X}_{m} \right) +\text {Y}_{m} \\ \text {X}_{m}^{'}={F}_{\text {down}}\left( \text {Y}_{m}^{'} +\text {X}_{\text {Ref}}\right) \end{array} \right. \end{aligned}$$
(4)

where \(\text {X}_{m}^{'}\) is the enhanced counterpart of \(\text {X}_{m}\), and \(\text {Y}_{m}^{'}\) refers to the features obtained from the cross-scale feature matching and transformer (CFMT) with the same size as the HR images. Here, \({F}_{\text {up}}\left( \cdot \right)\) and \({F}_{\text {down}}\left( \cdot \right)\) represent deconvolution upsampling and convolution downsampling operations with strides of s, respectively. The mutual-projection manner allows the FMP module to effectively enhance feature \(\text {X}_{m}\) according to the cross-scale dependencies in \(\text {Y}_{m}^{'}\) and cross-modality self-similarity priors in \(\text {X}_{\text {Ref}}\). It is worth noting that the FMP module can be flexibly inserted between any two RCA modules in a plug-and-play manner.

As the main component of the FMP method, the CFMT aims at mining the self-similarity property of different scale MRI features. As displayed in Fig. 2, the input \(\text {X}_{m}\in \mathbb {R}^{\text {H}\times \text {W}\times \text {L}}\) is first downsampled to \(\text {X}_{m}^{\downarrow }\in \mathbb {R}^{\frac{\text {H}}{s}\times \frac{\text {W}}{s}\times \frac{\text {L}}{s}}\) by the Cubic interpolation method with a scale of s, thereby ensuring that the dependencies between the captured cross-scale features correspond to the mapping between the LR and HR feature maps extracted by CRFM module. Through this manner, the internal image-specific exemplars can be mined to complement the external information captured from the training samples. Then, three convolution layers with \(1\times 1\times 1\) kernels extract the embedding features \(\text {X}^{\text {V}}\), \(\text {X}^{\text {Q}}\), and \(\text {X}^{\text {K}}\). Next, \(\text {X}^{\text {V}}\), \(\text {X}^{\text {Q}}\), and \(\text {X}^{\text {K}}\) are unfolded into patches \(v_j\), \(q_i\), and \(k_j\) with sizes of sp, p, and p and strides of sg, g, and g, respectively, where \(1\le i\le \lfloor \frac{\text {H}}{{g}} \rfloor \times \lfloor \frac{\text {W}}{{g}} \rfloor \times \lfloor \frac{\text {L}}{{g}} \rfloor\) and \(1\le j\le \lfloor \frac{\text {H}}{{sg}} \rfloor \times \lfloor \frac{\text {W}}{{sg}} \rfloor \times \lfloor \frac{\text {L}}{{sg}} \rfloor\). To extract the cross-scale dependencies between \(\text {X}^{\text {Q}}\) and \(\text {X}^{\text {K}}\), we calculate the similarity weight \(w_{i,j}\) for \(q_i\) and \(k_j\):

$$\begin{aligned} w_{i,j}=\frac{\exp \left(< q_i,k_{j}> \right) }{\sum _j{\exp \left( < q_i,k_{j} > \right) }} \end{aligned}$$
(5)

where \(< \cdot ,\cdot>\) and subscript respectively represent inner product operation and the coordinate of the weight w. The above unfolding operation and similarity calculation are implemented by convolution and softmax operations, where \(k_j\) is the kernel and \(q_i\) is the input.

To recover as many HR details as possible, \(w_{i, j}\) is assigned to corresponding patch \(v_j\), which can be written as

$$\begin{aligned} v'_{i}=\sum _j{w_{i,j}\otimes v_j} \end{aligned}$$
(6)

where \(\otimes\) refers to the element-wise product operation. Then, \(v'_{i}\) is folded to obtain the feature \(\text {Y}_m\) of size \(s\text {H}\times s\text {W}\times s\text {L}\). The aforementioned weighted aggregation and folding operation are achieved by a deconvolution with kernel \(v_j\) and input w. Through the cross-scale feature matching and transfer operation, \(\text {Y}_m\) contains abundant HR features from different scale patches.

Fig. 3
figure 3

The architecture of the proposed UFF module

Upsampling and Feature Fusion

To map the extracted features to HR space, we adopt an upsampling and feature fusion (UFF) operation as the tail of CRFM model. As shown in Fig. 3, the inputs to the UFF module include the reference feature \(\text {X}_\text {Ref}\), the feature set \(\text {X}_c\) produced by all improved RCA modules, and the output \(\text {Y}'_{m}\) of the FMP module. The size of \(\text {X}_\text {Ref}\) and \(\text {Y}'_{m}\) is the same as that of the desired HR estimation \(\text {I}_\text {SR}\), and the size of features in \(\text {X}_c\) is equal to that of \(\text {I}_\text {LR}\). Specifically, we apply a 3D subpixel convolution layer [45] to upsample \(\text {X}_c\) to the target size. Then, the upsampled features, namely, \(\text {X}_\text {Ref}\), and \(\text {Y}_{m}^{'}\), are fused via element-wise addition to produce new feature maps.

Although \(\text {Y}'_{m}\) contains rich high-frequency information that is beneficial to SR, there is some inevitable useless repetitive information. In addition, there may be errors in the registration between \(\text {X}_\text {Ref}\) and the upsampled input image. To alleviate these undesirable effects, we exploit parallel spatial attention (SA) and channel attention (CA), which can adaptively enhance meaningful features and while suppressing irrelevant information. Here, the architectures of the CA and SA are the same as those in [40] and [46], respectively. We expanded these CA and SA architectures to 3D space and used the LReLU activation function in the CA. The features refined by the CA and SA are connected and input into a convolution layer without being activated. Therefore, UFF \(\text {Y}_f\) produce the output through:

$$\begin{aligned} \text {Y}_f=F_{Conv}\left( \left[ \text {Y}_{CA},\text {Y}_{SA} \right] \right) \end{aligned}$$
(7)

where \(\text {Y}_{CA}\) and \(\text {Y}_{SA}\) are the CA and SA outputs, respectively. By comprehensively utilizing the inter-spatial and inter-channel relationships of the feature maps, the CRFM network can focus on informative feature regions and channels, thereby ensuring more efficient MRI image SR reconstruction.

Loss Functions

When training the proposed CRFM network, we adopt the mean absolute error (MAE) with a regular term as loss function to minimize the reconstruction error between \(\text {I}_{\text {SR}}\) and \(\text {I}_{\text {HR}}\). Let \(L\left( \cdot \right)\) denote the objective function of \(\text {N}\) training pairs; it is defined by

$$\begin{aligned} L\left( \theta \right) =\frac{1}{\text {N}}\sum _{k=1}^{\text {N}}{\Vert {F}\left( \text {I}_{\text {LR}}^{k};\theta \right) -\text {I}_{\text {HR}}^{k} \Vert _1}+\lambda \Vert \theta \Vert _2 \end{aligned}$$
(8)

where \({F}\left( \cdot \right)\) refers to the above-mentioned LR-to-HR reconstruction function represented by the CRFM network with parameter \(\theta\). Here, \(\lambda\) is set to \(1e{-6}\) to balance the loss function and the regular term. Meanwhile, \(\theta\) is updated by an Adam optimizer [47] with the learning rate of \(1e{-4}\).

Results

Implementation Details

Following [11, 14, 48], we trained the CRFM network using the Kirby21 [49] dataset (KKI06-KKI42) and tested it on the Kirby21 (KKI01-KKI05) and BRATS2015 [50] datasets. As in [14, 51], we adopted a Gaussian kernel (\(\sigma =1\) ) and Cubic downsampling to produce LR MRI images in the image domain. To imitate the acquisition of real MRI images [17, 26, 52]), we also produced LR MRI images in k-space. Specifically, we utilized fast Fourier transform to convert the original scans to k-space, followed by data truncation (partial data is set to zero). Then, we used inverse fast Fourier transform to obtain spatial domain data and finally produced LR images via Cubic downsampling. Before training, we cropped the LR inputs into \(26\times 26\times 26\) patches with the stride of 13. We evaluated the performance of CRFM method according to the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) [53].

In this paper, we used T2w and FLAIR MRI images as HR references and applied the CRFM network to SR reconstruct LR T1w MRI images. In reality, there may be some small shifts between different modalities images, which interfere with the SR task. To address this issue, we used the Cubic method to interpolate LR T1w images to the target size and registered the reference volumes onto the corresponding interpolated T1w images.

In the CRFM network, all convolution layers have 48 channels with kernel sizes of \(3\times 3\times 3\). The number of RCA modules was set to 10, and the FMP module was inserted between \(\text {RCA}_5\) and \(\text {RCA}_6\). In the FMP module, the patch size p and stride g were set to 3 and 2, respectively. The parameters of CRFM model are iteratively optimized for 100 epochs using PyTorch and Adam on an RTX 3090 GPU with the mini-batch size of 16. The MRI image SR performance was evaluated via the same equipment.

Table 1 Effects of the FMP position on SR performance and model parameters

Ablation Studies

This section discuses the influence of the main components and parameter setting of CRFM, including the insertion position and number of FMP modules, the patch size and stride of the CFMT, the number of channels and RCA modules, and the imaging modality. All models were trained with the Kirby21 dataset for the 2\(\times\) SR in the image domain.

  1. 1.

    Effects of the FMP position We inserted the FMP module at three typical positions: the head (between \(\text {RCA}_1\) and \(\text {RCA}_2\)), the middle (between \(\text {RCA}_5\) and \(\text {RCA}_6\)), and the end (between \(\text {RCA}_9\) and \(\text {RCA}_{10}\)) to investigate the influences of the position and count of FMP modules on the SR results and model parameters. The effects are presented in Table 1, and we can see that inserting the FMP module at any position boosts the SR reconstruction outcome. The best performance gains and balance between performance and efficiency were obtained by using one FMP module in the middle of the network. Although inserting multiple FMP modules into the network shows slightly better performance, the number of parameters increases linearly (approximately 4.3 MB per FMP module). Considering SR performance and model complexity, we inserted only one FMP module in the middle of the CRFM network.

    Table 2 Effects of the stride g and patch size p in CFMT on SR performance and GPU memory
    Table 3 Comparative study on the number of RCA and channels in RCA
  2. 2.

    Effects of the CFMT parameters Subsequently, we studied the influences of patch size p and stride g in the CFMT. As shown in Table 2, we first fixed \(g=1\) to study the effect of p. The GPU consumption was obtained by reconstructing the \(40\times 40\times 40\) LR patches. The PSNR and SSIM values in Table 2 indicate that the best performance was obtained when \(p=3\), which shows that small patches (\(p=3\)) can be better used as regional descriptors. The comparison of the GPU consumption of the models with \(p=3\) and \(p=5\) indicates that the significant improvement in the PSNR and SSIM is more worthy of attention. Therefore, we set the patch size to \(p=3\). Then, we explored the effect of g by fixing \(p=3\). As seen from Table 2, slightly better results are obtained from the model with \(g=1\) than \(g=2\), but the GPU consumption is noticeably higher by 3.48 GB. Although the GPU consumption of the network with \(g=2\) is slightly higher than that of \(g=3\), the PSNR and SSIM metrics are increased by a margin of 0.09dB and 0.0006, respectively. Finally, we set \(g=2\) to balance the GPU consumption and SR performance.

  3. 3.

    Effects of the number of RCA modules and channels Here, we first studied the effect of the number of RCA modules with 48 fixed channels. As shown in Table 3, when the number of RCA modules was increased from 8 to 10, the PSNR and SSIM metrics were significantly improved, and the SR performance remained constant when the number of RCA modules was increased to 12. Then, we investigated the influence of the number of convolution channels with 10 RCA modules. When the RCA modules had 48 channels, our model achieved the best PSNR and SSIM values. Therefore, the count of RCA modules and channels are respectively configured to 10 and 48.

  4. 4.

    Effects of the reference image modality In this paper, we also studied how to leverage HR reference images of different MRI modalities to promote the SR accuracy. As presented in Table 4, transferring the gradients of the reference image is an effective method to improve the SR performance. More specifically, when introducing the gradients of T2w images into the FMP module (i.e., C1), the PSNR and SSIM metrics were improved from 39.70dB and 0.9847 to 39.80dB and 0.9881, respectively. Similarly, only fusing gradients of T2w images in the UFF module (i.e., C2) improved the PSNR and SSIM values to 39.78dB and 0.9877, respectively. As predicted, when the FMP and UFF modules (C1 &C2) simultaneously incorporate gradients, the best performance is achieved, which indicates that referencing gradients of HR images is beneficial for SR. Directly incorporating the original images improved the SR results, but the PSNR and SSIM metrics are 0.08dB and 0.0016 lower than when introducing the gradients. Although MRI images have cross-modality self-similarity, there are some differences in content. The image gradients reflect voxel-level changes and contain the missed high-frequency texture details in the degraded image. Therefore, it is conducive to increase the resolution of the target modality images. From Table 4, we also can see that referencing T2w and FLAIR images have the same effect on improving the resolution of T1w, demonstrating the robustness of the CRFM network.

    Table 4 Study on reference modality over Kirby21 dataset with scale 2. Here, C1 and C2 indicate incorporating reference features into FMP and UFF, respectively
    Table 5 Effects of channel and spatial attention in UFF. Here, the symbol & represents that CA and SA are used in parallel, and the outputs of them are connected through channel direction. CA-SA means that they are used in series
  5. 5.

    The effects of CA and SA In light of the aforementioned analysis, we discussed the influence of channel and spatial attention in UFF module on SR performance. Table 5 shows the results of different attention architectures in the UFF module. Certainly, both CA and SA facilitated SR reconstruction, and simultaneously using them showed the most outstanding performance. It is noteworthy that adopting CA and SA in parallel achieved the best SR performance, which is beneficial for improving the resolution of the acquired MRI images.

    Table 6 Quantitative results with standard deviation of MRI image SR methods on the image domain degradation. The best and second-best are highlighted in bold and underline, respectively
    Table 7 The results of different methods on the BRATS2015 dataset degraded in the image domain

Comparisons with SOTA Methods

In this paper, we compared the proposed CRFM method with that of traditional methods (Cubic and NLM) and SOTA CNN-based SR techniques (SRCNN3D, ReCNN, EDDSR, and FASR) on the Kirby21 and BRATS2015 datasets. Here, all the compared methods were implemented with the parameters and settings provided by the corresponding paper.

Quantitative Evaluation

Tables 6 and 7 present the quantitative results of 2\(\times\) and 3\(\times\) SR reconstruction in image domain. Since FASR was trained by generative adversarial, it does not have superiority over PSNR and SSIM. Therefore, we used FASR-L\(_1\) that merely trained with L\(_1\) loss for a fair comparison. Here, CFT represents the non-reference version of the proposed CRFM network. As compared in Tables 6 and 7, among the non-reference SR methods, the CFT method achieved the best results on all datasets and scale factors. Furthermore, our CRFM method that referencing gradients of HR T2w image outperformed all compared methods. This observation shows that referencing the features of HR images could effectively compensate for the missed mid- and high-frequency details for LR MRI images. On the Kirby21 dataset, the improvement in the PSNR and SSIM (0.41dB and 0.0049) with CRFM over FASR-L\(_1\) were significant on scale of 2\(\times\). Meanwhile, CRFM obtained the highest PSNR of 36.12dB and SSIM of 0.9664 at a scale of 3\(\times\). For the BRATS2015 dataset collected from glioma patients, the CRFM also obtained the best PSNR and SSIM results at both 2\(\times\) and 3\(\times\).

Tables 8 and 9 show the SR results of reconstructing the images degenerated in k-space. The proposed CRFM method showed a substantial advantage over the other methods, demonstrating the stable SR performance of our method under real world degradation conditions. Similar to the image domain degradation, the proposed method achieved SOTA results on all datasets and scale factors with reference and non-reference images. In particular, on the Kirby21 dataset, the proposed CRFM outperformed FASR-L\(_1\), with PSNR improvements of 0.74dB and 0.41dB on the 2\(\times\) and 3\(\times\) SR reconstruction, respectively. Furthermore, our CFT and CRFM networks obtained more stable result distributions (smaller SDs) than other deep learning SR methods. These results and comparisons demonstrate the superiorities of our CRFM method over SOTA methods.

Table 8 Results of 2\(\times\)and 3\(\times\)SR reconstruction of MRI image on the k-space degradation. The best and second-best results are highlighted in bold and underline, respectively
Table 9 The results of different methods on the BRATS2015 dataset degraded in the k-space
Fig. 4
figure 4

SR results of an MRI case (KKI02 from Kirby21) degenerated in the image domain with isotropic scale factor 2×. The zoomed area (red arrow) illustrates that CRFM restored more fine anatomy details and produced the best performance

Fig. 5
figure 5

SR results of an MRI image (KKI03 from Kirby21) degraded in the k-space with the isotropic scale factor 3\(\times\)

Fig. 6
figure 6

Comparisons on an MRI scan with glioma (T1c.36601 from BRATS2015) that reconstructed by different SR techniques with the isotropic scale factor 2\(\times\) (degraded in the k-space)

Visual Evaluation

Figures 4 and 5 provide visual comparisons of MRI images collected from healthy volunteers under spatial domain and k-space degradation, respectively. The zoomed view of the restored image shows that the proposed CFT and CRFM networks maintained more anatomical details than the other methods. A visual inspection shows that the reference-based CRFM network produced more clear SR images than the SISR techniques, which demonstrate the effectiveness of embedding cross-modality image features in the SR task. Figure 6 visualizes an image of a glioma scan (T1c.36601 in the BRATS2015) reconstructed by different SR methods. The images produced via compared methods were blurry, especially that interpolated with the Cubic method. In contrast, our CFT and CRFM methods better recovered the glioma part and eliminated blurred edges (indicated by the red arrow) to a certain extent.

Discussion

SRFormer [55] employs permuted self-attention to efficiently establish relationships among pixel pairs within large windows, while DATSR [33] extracts texture features from reference images to supplement detail information in low-resolution (LR) images. However, our application differs from these two methods. We propose an innovative cross-modal reference and feature mutual projection (CRFM) method that effectively transfers high-resolution texture details from the reference modality to the target MRI image by incorporating cross-modal reference information. The feature mutual projection mechanism allows us to capture internal correlations across different scales, further enhancing super-resolution performance. This method’s novelty and practicality hold significant implications for MRI image analysis and diagnosis. There are two main technical distinctions between SRFormer, DATSR, and our proposed method (CRFM): First, in reference-based super-resolution tasks for natural images (as addressed by SRFormer), registration steps are typically crucial due to the inherent variations in viewpoint or environment among the source images. Conversely, in MRI super-resolution reconstruction, particularly when dealing with different modalities of scans that depict the same anatomic structure of the same subject, rigorous spatial registration is not always necessary to harness high-resolution information from other modality reference images. The inherent correlation and consistency between MRIs, which provide complementary tissue contrast information of the same subject, make them particularly suitable for implementing reference-based super-resolution techniques. CRFM effectively leverages this characteristic by merging the attributes of multi-modal MRI images, thus significantly enhancing the resolution and diagnostic value of a specific MRI modality image. Second, CRFM uniquely utilizes gradient information from high-resolution MRI images as a reference input and incorporates a feature mutual projection (FMP) module designed to capture dependencies and similarity details across scales and modalities in MRI images, a strategy that is not commonly found in single-modal image-based super-resolution methods like SRFormer. By excavating such internal feature correlations, CRFM improves the accuracy of detail recovery during the super-resolution reconstruction process. Furthermore, CRFM’s distinctive FMP module, rooted in cross-scale similarity, delves deeply into the intrinsic interdependencies between MRI images at different scales, thereby enabling more precise restoration of lesion regions and elimination of blurred edges, a level of refinement that pure single-scale self-attention mechanisms as employed by SRFormer alone are unable to match.

Conclusion

In this paper, we propose a cross-modality reference and feature mutual-projection (CRFM) method to increase the resolution of brain MRI image.Specifically, the CRFM network integrates reference modality MRI images with global cross-scale self-similarity priors to extract gradients from the reference image which are extracted to mine potential external HR details. Meanwhile, we designed a mutual-projection feature enhancement method to capture cross-scale correlations across the MRI features to effectively mine potential internal HR details. At the end of the CRFM network, parallel attentions were used to refine informative channels and feature regions. Extensive experiments on two publicly available MRI datasets demonstrate that CRFM significantly outperforms the current state-of-the-art (SOTA) methods in terms of super-resolution reconstruction. The method enables us to obtain high-quality brain scans with rich detail, which is poised to greatly facilitate more accurate diagnoses and ultimately support clinicians in making more informed medical decisions.