Keywords

1 Introduction

High-resolution magnetic resonance (MR) images (MRI) provide a wealth of structural details, which facilitate early and precise diagnosis [1]. However, images obtained in clinical practice are anisotropic due to the limitation of scan time and signal-noise ratio [2]. In order to speed up clinical scanning procedures, only a limited number of two-dimensional (2D) slices are acquired, despite the fact that the interested anatomical structures are in three-dimensional (3D). The acquired medical images have low inter-plane resolution, i.e., large spacing between slices. Such anisotropic images will lead to misdiagnosis and can greatly impact the performance of various clinical tasks, including computer-aided diagnosis and computer-assisted interventions. Therefore, we investigate the problem of reducing the slice spacing [3] via super-resolution (SR) reconstruction. Specifically, we refer to the image with large slice spacing as a low-resolution (LR) image and the image with small slice spacing as a high-resolution (HR) image. Our goal is to reconstruct the HR image from the LR input, which is an ill-posed inverse problem and presents significant challenges.

Deep learning-based algorithms for single MR image super-resolution show great potential in restoration of HR images from LR inputs [4]. Pham et al. [5] proposed the SRCNN method, which applied convolutional neural networks (CNN) to image super-resolution of MRI and achieved a better performance than the conventional methods, such as B-spline interpolation and low-rank total variation (LRTV) [6] method. Chaudhariet al. [7] proposed a 3D residual network, which learned the residual-based transformations between paired LR and HR images for the SR reconstruction of MRI. Chen et al. [8] proposed a densely connected super-resolution network (DCSRN), which reused the block features through the dense connection in the SR reconstruction of MRI. Chen et al. [9] extended this work by using generative adversarial network (GAN) [10] in SR reconstruction of MRI in order to improve the realism of the recovered images. Feng et al. [11, 12] proposed a multi-contrast MRI SR method, which aimed to learn clearer anatomical structure and edge information with the help of auxiliary contrast MRI. Despite significant progress, however, there are still spaces for further improvement. Most networks require a large amount of paired LR and HR MR images for training, which are unrealistic in clinical practice. To address the challenge of organizing paired images, methods based on unpaired images have been proposed [13, 14]. However, HR MR images are still difficult to obtain, as acquiring HR MR images in clinical settings requires a significant amount of time. In contrast, CT images are acquired in clinical routine. Therefore, it is of great significance to use HR CT images as a guidance to synthesize HR MR images from LR MR images.

To this end, we propose a CT-guided, unsupervised MRI super-resolution reconstruction method based on joint cross-modality image translation (CIT) and super-resolution reconstruction, eliminating the requirement of HR MR images for training. Specifically, our network design features a super-resolution Network (SRNet) and a cross-modality image translation network (CITNet) based on disentanged representation learning. After pretraining, the SRNet can generate pseudo HR MR images from LR MR images. The generated pseudo HR MR images are then taken together with the HR CT images as the input to the CITNet, which can generate quality-improved pseudo HR MR images by combining disentangled content code of the input CT data with the attribute code of the input pseudo HR MR images. Joint optimization of the CITNet and the SRNet leads to better and better pseudo HR MR image generation. When converged, we can use the SRNet to generate high-quality pseudo HR MR images from given LR MR images. The contributions of our work can be summarized as follows:

Fig. 1.
figure 1

A schematic illustration of our CT-guided, unsupervised MRI super-resolution reconstruction method. (A) Network architecture, including a SRNet and a CITNet; (B) Pretraining the SRNet; and (C) Joint optimization of the CITNet and the SRNet. Different colors represent different domains, i.e., orange represents the MR domain, green represents the CT domain, and white shows the shared content space. (Color figure online)

  • We propose a CT-guided, unsupervised MRI super-resolution reconstruction method based on joint cross-modality image translation and super-resolution reconstruction, eliminating the requirement of HR MRI for training. Our cross-modality image translation is based on disentangled representation leanring.

  • Our network design features a SRNet and a CITNet. They work jointly to generate high-quality pseudo HR MR images from given LR MR images. Concretely, a better trained SRNet will help to generate a better input to the CITNet. On the other hand, the CITNet, taking the SRNet-generated pseudo HR MR images and the HR CT images as input, provides better supervision of the SRNet training. Joint optimization of the CITNet and the SRNet leads to the generation of high-quality pseudo HR MR image at the end.

  • We validate the proposed method on two datasets collected from two different clinical centers.

2 Methodology

Figure 1 presents a schematic illustration of our CT-guided, unsupervised MRI super-resolution reconstruction method. It features two networks: the SRNet and the CITNet (Fig. 1-(A)). Figure 1-(B) shows how to pretrain the SRNet while Fig. 1-(C) presents how to conduct joint optimization. Below we first present the design of the SRNet and the CITNet, followed by a description of the traing strategy.

2.1 Super-Resolution Network (SRNet)

We choose to use the residual dense network (RDN) as the SRNet. The RDN utilizes cascaded residual dense blocks (RDBs), a powerful convolutional block that leverages residual and dense connections to fully aggregate hierarchical features. For further details on the structure of the RDN, please refer to the original paper [15]. Mathematically, we denote the SRNet as \(\mathcal {F}_{s} (\cdot ; \varTheta _{s})\) with trainable parameters \(\varTheta _{s}\).

2.2 Cross-Modality Image Translation Network (CITNet)

The CITNet is inspired by MUNIT [16]. As depicted in Fig. 1-(A.2), it comprises two content encoders \(\left\{ E_{\mathcal {X}}^{\mathcal {C}},E_{\mathcal {Y}}^{\mathcal {C}}\right\} \), two attribute encoders \(\left\{ E_{\mathcal {X}}^{\mathcal {A}},E_{\mathcal {Y}}^{\mathcal {A}}\right\} \), and two generators \(\left\{ G_{\mathcal {X}},G_{\mathcal {Y}}\right\} \). The encoder in each domain disentangles an input image separately into a domain-invariant content space \(\mathcal {C}\) and a domain-specific attribute space \(\mathcal {A}\). And the generator networks combine a content code with an attribute code to generate translated images in the target domain. For instance, when translating CT image \(y_{H}\in {\mathcal {Y}}\) to MR image \(x_{H}^{'}\in {\mathcal {X}}\), we first randomly sample from the prior distribution \(p(\mathcal {A}_{x}^{'}) \sim \mathcal {N}(0, \textbf{I})\) to obtain an MRI attribute code \(\mathcal {A}_{x}^{'}\), which is empirically set as a 8-bit vector. We then combine \(\mathcal {A}_{x}^{'}\) with the disentangled content code of the CT image \(\mathcal {C}_{y}=E_{\mathcal {Y}}^{\mathcal {C}}(y_{H})\) to generate the translated MRI image \(x_{H}^{'}\in {\mathcal {X}}\) through the generator \(G_{\mathcal {X}}\). Similarly, we can get the the translated CT image \(\tilde{y}_{H}^{'}\in {\mathcal {Y}}\) through the generator \(G_{\mathcal {Y}}(\mathcal {C}_{x},\mathcal {A}_{y}^{'})\), where \(\mathcal {C}_{x}=E_{\mathcal {X}}^{\mathcal {C}}(\mathcal {F}_{s}(x_{L};\varTheta _{s}))\) and \(\mathcal {A}_{y}^{'}\) is also sampled from the prior distribution \(p(\mathcal {A}_{y}^{'}) \sim \mathcal {N}(0, \textbf{I})\).

Disentangled Representation Learning. Cross-modality image translation is based on disentangled representation learning, trained with self- and cross-cycle reconstruction losses. As shown in Fig. 1-(C.1, C.2), the self-reconstruction loss \(L_{\text {self}}\) is utilized to regularize the training when the content and attribute code originate from the same domain, whereas the cross-cycle consistency loss \(L_{\text {cycle}}\) is used when the content and attribute code come from different domains. The self-reconstruction and cross-cycle reconstruction losses are defined as follows:

$$\begin{aligned} L_{\text {self}}=\left\| G_{\mathcal {X}}\left( E_{\mathcal {X}}^{\mathcal {C}}(\tilde{x}_{H}), E_{\mathcal {X}}^{\mathcal {A}}(\tilde{x}_{H})\right) -\tilde{x}_{H}\right\| _{1}+\left\| G_{\mathcal {Y}}\left( E_{\mathcal {Y}}^{\mathcal {C}}(y_{H}), E_{\mathcal {Y}}^{\mathcal {A}}(y_{H})\right) -y_{H}\right\| _{1} \end{aligned}$$
(1)
$$\begin{aligned} L_{\text {cycle}}=\Vert G_{\mathcal {X}}(E_{\mathcal {Y}}^{\mathcal {C}}(\tilde{y}_{H}^{'}),E_{\mathcal {X}}^{\mathcal {A}}(\tilde{x}_{H}))-\tilde{x}_{H}\Vert _{1}+\Vert G_{\mathcal {Y}}(E_{\mathcal {X}}^{\mathcal {C}}(x_{H}^{'}),E_{\mathcal {Y}}^{\mathcal {A}}(y_{H}))-y_{H}\Vert _{1} \end{aligned}$$
(2)

where \(\tilde{x}_{H}=\mathcal {F}_{s}(x_{L};\varTheta _{s})\), \(x_{H}^{'}=G_{\mathcal {X}}(E_{\mathcal {Y}}^{\mathcal {C}}(y_H),\mathcal {A}_{x}^{'})\), \(\tilde{y}_{H}^{'}=G_{\mathcal {Y}}(E_{\mathcal {X}}^{\mathcal {C}}(\tilde{x}_{H}),\mathcal {A}_{y}^{'})\). Specially, in the cross-cycle translation processes, we employe a latent reconstruction loss to maintain the invertible mapping between the image and the latent space. In details, we have:

$$\begin{aligned} L_{\text {latent}}=\Vert \hat{\mathcal {C}_{x}}-\mathcal {C}_{x}\Vert _{1}+\Vert \hat{\mathcal {C}_{y}}-\mathcal {C}_{y}\Vert _{1}+\Vert \hat{\mathcal {A}_{x}}-\mathcal {A}_{x}^{'}\Vert _{1}+\Vert \hat{\mathcal {A}_{y}}-\mathcal {A}_{y}^{'}\Vert _{1} \end{aligned}$$
(3)

We further use pretrained vgg16 network, denoted as \(\phi ( \cdot )\), to extract high-level features for computing the perceptual loss [17]:

$$\begin{aligned} L_{\text {percep}}=\frac{1}{C H W}\left\| \phi (\tilde{y}_{H}^{'})-\phi (\tilde{x}_{H})\right\| _{2}^{2}+\frac{1}{C H W}\left\| \phi (x_{H}^{'})-\phi (y_{H})\right\| _{2}^{2} \end{aligned}$$
(4)

where C, H, W indicate the channel number and the image size, respectively.

Adversarial Learning. As shown in Fig. 1-(A.2), we use GAN [10] to learn the translation between MR and CT image domains better. A GAN typically contains a generation network and a discrimination network. We use the discriminator \(D_{\mathcal {X}}\) to judge whether the image is from MR image domain, and the discriminator \(D_{\mathcal {Y}}\) to judge whether the image is from CT image domain. The auto-encoders try to generate the image of the target domain to fool the discriminators so that the distribution of the translated images can match that of the target images. The minmax game is trained by:

$$\begin{aligned} L_{a d v}^{\mathcal {X}}=\mathbb {E}_{\tilde{x}_{H} \sim P_{\mathcal {X}}(\tilde{x}_{H})}\left[ \log D_{\mathcal {X}}(\tilde{x}_{H})\right] +\mathbb {E}_{y_{H} \sim P_{\mathcal {Y}}(y_{H})}\left[ \log (1-D_{\mathcal {X}}(x_{H}^{'}))\right] \end{aligned}$$
(5)
$$\begin{aligned} L_{a d v}^{\mathcal {Y}}=\mathbb {E}_{y_{H} \sim P_{\mathcal {Y}}(y_{H})}\left[ \log D_{\mathcal {Y}}(y_H)\right] +\mathbb {E}_{\tilde{x}_{H} \sim P_{\mathcal {X}}(\tilde{x}_{H})}\left[ \log (1-D_{\mathcal {Y}}( \tilde{y}_{H}^{'}))\right] \end{aligned}$$
(6)

Joint Optimization. The SRNet and the CITNet are jointly optimized by minimizing following loss function:

$$\begin{aligned} L_{disentangle}=\left( L_{adv}^{\mathcal {X}}+L_{adv}^{\mathcal {Y}}\right) +\lambda _{1}(L_{self}+L_{cycle})+\lambda _{2} L_{l a tent}+\lambda _{3} L_{percep} \end{aligned}$$
(7)

where \(\lambda _{1}\), \(\lambda _{2}\), and \(\lambda _{3}\) are parameters controlling the relative weights of different losses.

2.3 Training Strategy

Empirically, we found that training the network shown in Fig. 1-(A) end to end did not converge. We thus design the following three-stage training strategy.

Stage 1. Let’s denote the downsampling function as \(\mathcal {D}(\cdot )\). In this stage, we pretrain the SRNet using the HR CT images, as shown in Fig. 1-(B.1), for T iterations. At each iteration, we sample a batch of HR CT images. We then downsample the sampled HR CT images \(y_{H}\) to get the paired LR CT images \(y_{L}=\mathcal {D}(y_{H})\). The SRNet is trained with the paired LR-HR CT images by minimizing L1 loss \(\Vert y_{H}-\mathcal {F}_{s}(\mathcal {D}(y_{H});\varTheta _{s})\Vert _{1}\). In this stage, we are aiming to train the SRNet to learn the upsampling kernels.

figure a

Stage 2. As the SRNet is only pretrained with CT images in stage 1, we need to generalize the learned upsampling kernels to the MR image domain. We thus further pretrain the SRNet with pseudo MR images, as shown in Fig. 1-(B.2), for another T iterations. At each iteration, we first sample a batch of LR MR images \(x_{L}\) and input them into the SRNet to get the pseudo HR MR images \(\tilde{x}_{H}=\mathcal {F}_{s}(x_{L};\varTheta _{s})\). We then downsample \(\tilde{x}_{H}\) to get corresponding pseudo LR MR images \(\tilde{x}_{L}=\mathcal {D}(\tilde{x}_{H})\). The SRNet is trained with the paired pseudo LR-HR MR images by minimizing L1 loss \(\Vert \mathcal {F}_{s}(x_{L};\varTheta _{s})-\mathcal {F}_{s}(\mathcal {D}(\mathcal {F}_{s}(x_{L};\varTheta _{s}));\varTheta _{s})\Vert _{1}\). The idea behind such a pretraining stategy is that since both CT and MR images share the common structural information, the model pretrained with CT images in stage 1 facilitates the super-resolution reconstruction of pseudo HR MR images in stage 2. On the other hand, the training done in stage 2 can help the SRNet to learn MRI-specific domain information.

Stage 3. The MR images generated by the model pretrained at the first two stages can be further improved. In stage 3, we conduct joint optimization of the SRNet and the CITNet as shown in Fig. 1-(C), for another \(8 \times T\) iterations. At each iteration, we first train \(D_{\mathcal {X}}\), \(D_{\mathcal {Y}}\) by maximizing \(\left( L_{adv}^{\mathcal {X}}+L_{adv}^{\mathcal {Y}}\right) \). We then train \(E_{\mathcal {X}}^{\mathcal {C}}\), \(E_{\mathcal {Y}}^{\mathcal {C}}\), \(E_{\mathcal {X}}^{\mathcal {A}}\), \(E_{\mathcal {Y}}^{\mathcal {A}}\), \(G_{\mathcal {X}}\), \(G_{\mathcal {Y}}\) and the SRNet by minimizing \(L_{disentangle}\) as defined in Eq. (7).

The training procedure of our method is illustrated by Algorithm 1.

Implementation Details. To train the proposed network, each training sample is unpaired LR MRI and HR CT images. All images are normalized to the range between -1.0 and 1.0. Optimization is performed using Adam with a batch size of 1. The initial learning rate is set to 0.0001 and decreased by a factor of 5 every 2 epochs. We empirically set \(\lambda _{1}=10\), \(\lambda _{2}=\lambda _{3}=1\) and \(T=100,000\).

Table 1. The mean and the standard deviation when the proposed method was compared with the state-of-the-art (SOTA) unsupervised [18,19,20] and supervised [15, 21] methods on both datasets. Paired T-Tests of all evaluation metrics achieved by ours and other methods are all smaller than 0.0001.
Table 2. Results of ablation study on dataset from Site1.
Fig. 2.
figure 2

Visual comparison of different methods when evaluated on dataset from Site1.

Fig. 3.
figure 3

Visual comparison of different methods when evaluated on dataset from Site2.

Fig. 4.
figure 4

Examples of cross-modality image translation between MRI and CT using data from Site2.

3 Experiments

Dataset. We conduct experiments to evaluate the proposed method on two datasets acquired from two different clinical centers. The dataset from HFR Cantonal Hospital, University of Fribourg (Site1) consists of 50 paired MR-CT volumes, which are divided into training (35 volumes), validation (5 volumes), and testing sets (10 volumes). The HR MRI are acquired by coronal plane and the voxel spacing of both HR CT and MRI are 1.0*1.0*1.0 \(mm^3\). We downsample along the coronal axis with a scale factor \(K=4\) to generate the LR MRI with a voxel spacing 1.0*1.0*(1.0*K) \(mm^3\). We shuffle the paired MR-CT volumes and only use the unpaired LR MRI and HR CT for training. Then we use the HR MRI to evaluate the reconstruction metrics. The dataset from the University Hospital of Bern (Site2) consists of 19 unpaired MR-CT volumes, which are divided into training (13 volumes) and testing sets (6 volumes). The HR MRI are acquired by coronal plane and the voxel spacing of both HR CT and MRI are 1.0*1.3*1.3 \(mm^3\). We downsample along the coronal axis by a scale factor \(K=4\) to generate the LR MRI with a voxel spacing 1.0*1.3*(1.3*K) \(mm^3\).

Experimental Results. We compare our method with the conventional algorithm bicubic interpolation, and the state-of-the-art (SOTA) unsupervised SR methods including TSCN [18], ZSSR [19], SMORE [20] as well as the SOTA supervised methods including RDN [15] and ReconResNet [21]. Well-established metrics including Peak Signal-to-Noise Ratio (PSNR) [22, 23], Structural Similarity Index Metrics (SSIM) [24], and Learned Perceptual Image Patch Similarity (LPIPS) [25] are used to assess the performance of different methods.

Table 1 shows the mean and the standard deviation of the evaluation results of each method on both datasets. Figure 2 and Fig. 3 respectively show the super-resolution results on data from Site1 and Site2, when the scale factor is set as \(K=4\), as well as the corresponding LR and ground truth (GT) images. Both qualitative and quantitative results demonstrated that our method achieved better results than other SOTA unsupervised SR methods. It achieved comparable performance when compared with the supervised SR methods.

Our method is trained in two pretrain stages and one joint optimization stage. We thus conduct ablation study on dataset from Site1 to analyze the quality of the generated pseudo HR MR images at each stage. As shown in Table 2, quantitatively, the quality of the generated pseudo HR MR images is become better and better, demonstrating the effectiveness of the training strategy.

4 Conclusion

In this paper, we proposed a CT-guided, unsupervised MRI super-resolution reconstruction method based on joint cross-modality image translation and super-resolution reconstruction, eliminating the requirement of HR MRI for training. We conducted experiments on two datasets respectively acquired from two different clinical centers to validate the effectiveness of the proposed method. Quantitatively and qualitatively, the proposed method achieved superior performance over the SOTA unsupervised SR methods.