Keywords

1 Introduction

Medical images are essential in diagnosing and monitoring various diseases and patient conditions. Different imaging modalities, such as computed tomography (CT) and magnetic resonance imaging (MRI), and different parametric images, such as T1 and T2 MRI, have been developed to provide clinicians with a comprehensive understanding of the patients from multiple perspectives [7]. However, in clinical practice, it is commonly difficult to obtain a complete set of multiple modality images for diagnosis and treatment due to various reasons, such as modality corruption, incorrect machine settings, allergies to specific contrast agents, and limited available time [5, 10]. Therefore, cross-modality medical image synthesis is useful by allowing clinicians to acquire different characteristics across modalities and facilitating real-world applications in radiology and radiation oncology [28, 32].

With the rise of deep learning, numerous studies have emerged and are dedicated to medical image synthesis [4, 7, 18]. Notably, generative adversarial networks (GANs) [8] based approaches have garnered significant attention in this area due to their success in image generation and image-to-image translation [11, 33]. Moreover, GANs are also closely related to cross-modality medical image synthesis [2, 10, 32]. However, despite their efficacy, GANs are susceptible to mode collapse and unstable training, which can negatively impact the performance of the model and decrease the reliability in practice [1, 17]. Recently, the advent of denoising diffusion probabilistic models (DDPMs) [9, 24] has introduced a new scheme for high-quality generation, offering desirable features such as better distribution coverage and more stable training when compared to GAN-based counterparts. Benefiting from the better performance [6], diffusion-based models may be deemed much more reliable and dominant and recently researchers have made the first attempts to employ diffusion models for medical image synthesis [12,13,14, 19].

Different from natural images, most medical images are volumetric. Previous studies employ 2D networks as backbones to synthesize slices of medical volumetric data due to their ease of training [18, 32] and then stack 2D results for 3D synthesis. However, this fashion induces volumetric inconsistency, particularly along the z-axis when following the standard way of placing the coordinate system. Although training 3D models may avoid this issue, it is challenging and impractical due to the massive amount of volumetric data required, and the higher dimension of the data would result in costly memory requirements [3, 16, 26]. To sum up, balancing the trade-off between training and volumetric consistency remains an open question that requires further investigation.

In this paper, we propose Make-A-Volume, a diffusion-based pipeline for cross-modality 3D brain MRI synthesis. Inspired by recent works that factorize video generation into multiple stages [23, 31], we introduce a new paradigm for volumetric medical data synthesis by leveraging 2D backbones to simultaneously facilitate high-fidelity cross-modality synthesis and mitigate volumetric inconsistency for medical data. Specifically, we employ a latent diffusion model (LDM) [20] to function as a slice-wise mapping that learns cross-modality translation in an image-to-image manner. Benefiting from the low-dimensional latent space of LDMs, the high memory requirements for training are mitigated. To enable the 3D image synthesis and enhance volumetric smoothness among medical slices, we further insert and fine-tune a series of volumetric layers to upgrade the slice-wise model to a volume-wise model. In summary, our contributions are three-fold: (1) We introduce a generic paradigm for 3D image synthesis with 2D backbones, which can mitigate volumetric inconsistency and training difficulty related to 3D backbones. (2) We propose an efficient latent diffusion-based framework for high-fidelity cross-modality 3D medical image synthesis. (3) We collected a large-scale high-quality dataset of paired susceptibility weighted imaging (SWI) and magnetic resonance angiography (MRA) brain images. Experiments on these in-house and public T1-T2 brain MRI datasets show the volumetric consistency and superior quantitative result of our framework.

Fig. 1.
figure 1

Overview of our proposed two-stage Make-A-Volume framework. A latent diffusion model is used to predict the noises added to the image and synthesize independent slices from Gaussian noises. We insert volumetric layers and quickly fine-tune the model, which extends the slice-wise model to be a volume-wise model and enables synthesizing volumetric data from Gaussian noises.

2 Method

2.1 Preliminaries of DDPMs

In the diffusion process, DDPMs produce a series of noisy inputs \({x_0, x_1, ..., x_T}\), via sequentially adding Gaussian noises to the sample over a predefined number of timesteps T. Formally, given clean data samples which follow the real distribution \(x_0 \sim q(x)\), the diffusion process can be written down with variances \(\beta _1, ..., \beta _T\) as

$$\begin{aligned} q(x_t | x_{t-1})&= \mathcal {N}(x_t; \sqrt{1 - \beta _t}x_{t-1}, \beta _t \textbf{I}). \end{aligned}$$
(1)

Employing the property of DDPMs, the corrupted data \(x_t\) can be sampled easily from \(x_0\) in a closed form:

$$\begin{aligned} q(x_t | x_0) = \mathcal {N}(x_t; \sqrt{\bar{\alpha }_t}x_0, (1 - \bar{\alpha }_t) \textbf{I}); \ \ x_t = \sqrt{\bar{\alpha }_t}x_0 + \sqrt{1 - \bar{\alpha }_t} \epsilon , \end{aligned}$$
(2)

where \(\alpha _t = 1 - \beta _t\), \(\bar{\alpha }_t = \prod _{s=1}^t \alpha _s\), and \(\epsilon \sim \mathcal {N}(0, 1)\) is the added noise.

In the reverse process, the model learns a Markov chain process to convert the Gaussian distribution into the real data distribution by predicting the parameterized Gaussian transition \(p(x_{t-1} | x_t)\) with the learned model \(\theta \):

$$\begin{aligned} p_{\theta }(x_{t-1}|x_t)&= \mathcal {N}(x_{t-1}; \mu _{\theta }(x_t, t), \sigma ^2_t\textbf{I}). \end{aligned}$$
(3)

In the model training, the model tries to predict the added noise \(\epsilon \) with the simple mean squared error (MSE) loss:

$$\begin{aligned} L (\theta ) = \mathbb {E}_{x_0 \sim q(x), \epsilon \sim \mathcal {N}(0, 1), t}\left[ \left\| \epsilon - \epsilon _\theta (\sqrt{\bar{\alpha }_t} x_0 + \sqrt{1-\bar{\alpha }_t}\epsilon , t) \right\| ^2\right] . \end{aligned}$$
(4)

2.2 Slice-Wise Latent Diffusion Model

To improve the computational efficiency of DDPMs that learn data in pixel space, Rombach et al. [20] proposes training an autoencoder with a KL penalty or a vector quantization layer [15, 27], and introduces the diffusion model to learn the latent distribution. Given calibrated source modality image \(x_c\) and target modality image x, we leverage a slice-wise latent diffusion model to learn the cross-modality translation. With the pretrained encoder \(\mathcal {E}\), \(x_c\) and x are compressed into a spatially lower-dimensional latent space of reduced complexity, generating \(z_c\) and z. The diffusion and denoising processes are then implemented in the latent space and a U-Net [21] is trained to predict the noise in the latent space. The input consists of the concatenated \(z_c\) and z and the network learns the parameterized Gaussian transition \(p_{\theta }(z_{t-1}|z_t, z_c) = \mathcal {N}(z_{t-1}; \mu _{\theta }(z_t, t, z_c), \sigma ^2_t \textbf{I})\). After learning the latent distribution, the slice-wise model can synthesize target latent \(\hat{z}\) from Gaussian noise, given the source latent \(z_c\). Finally, the decoder \(\mathcal {D}\) restores the slice to the image space via \(\hat{x} = \mathcal {D}(\hat{z})\).

2.3 From Slice-Wise Model to Volume-Wise Model

Figure 1 illustrates an overview of the Make-A-Volume framework. The first stage involves a latent diffusion model that learns the cross-modality translation in an image-to-image manner to synthesize independent slices from Gaussian noises. Then, to extend the slice-wise model to be a volume-wise model, we insert volumetric layers and quickly fine-tune the U-Net. As a result, the volume-wise model synthesizes volumetric data without inconsistency from Gaussian noises.

In the slice-wise model, distribution of the latent \(z \in \mathbb {R}^{b_s \times c\times h\times w}\) is learned by the U-Net, where \(b_s, c, h, w\) are the batch size of slice, channels, height, and width dimensions respectively, and there is where little volume-awareness is introduced to the network. Since we target in synthesizing volumetric data and assume each volume consists of N slices, we can factorize the batch size of slices as \(b_s = b_v n\), where \(B_v\) represents the batch size of volumes. Now, volumetric layers are injected and help the U-Net learn to latent feature \(f \in \mathbb {R}^{(b_v\times n) \times c\times h\times w}\) with volumetric consistency. The volumetric layers are basic 1D convolutional layers and the \(i-\)th volumetric layer \(l^i_v\) takes in feature f and outputs \(f^{\prime }\) as:

$$\begin{aligned}&f^{\prime } \leftarrow \text {Rearrange}(f, (b_v\times n)\ c\ h\ w \rightarrow (b_v\times h\times w)\ c\ n),\end{aligned}$$
(5)
$$\begin{aligned}&f^{\prime } \leftarrow l^i_v(f^{\prime }),\end{aligned}$$
(6)
$$\begin{aligned}&f^{\prime } \leftarrow \text {rearrange}(f, (b_v\times h\times w)\ c\ n \rightarrow (b_v\times n)\ c\ h\ w). \end{aligned}$$
(7)

Here, the 1D conv layers combined with the pretrained 2D conv layers, serve as pseudo 3D conv layers with little extra memory cost. We initialize the volumetric 1D convolution layers as Identity Functions for more stable training and we empirically find tuning is efficient. With the volume-aware network, the model learns volume data \(\{x^i\}^n_{i=1}\), predicts \(\{z^i\}^n_{i=1}\), and reconstruct \(\{{\hat{x}}^i\}^n_{i=1}\). For diffusion model training, in the first stage, we randomly sample timestep t for each slice. However, when tuning the second stage, the U-Net with volumetric layers learns the relationship between different slices in one volume. As a result, fixing t for each volume data is necessary and we encourage the small t values to be sampled more frequently for easy training. In detail, we sample the timestep t with replacement from multinomial distribution, and the pre-normalized weight (used for computing probabilities after normalization) for timestep t equals \(2T-t\), where T is the total number of timesteps. Therefore, we enable a seamless translation from the slice-wise model which processes slices individually, to a volume-wise model with better volumetric consistency.

3 Experiments

Datasets. The experiments were conducted on two brain MRI datasets: SWI-to-MRA (S2M) dataset and RIRE [30]Footnote 1 T1-to-T2 dataset. To facilitate SWI-to-MRA brain MRI synthesis applications, we collected a high-quality SWI-to-MRA dataset. This dataset comprises paired SWI and MRA volume data of 111 patients that were acquired at Qilu Hospital of Shandong University using one 3.0T MRI scanner (i.e., Verio from Siemens). The SWI scans have a voxel spacing of \(0.3438\times 0.3438\times 0.8\) mm and the MRA scans have a voxel spacing of \(0.8984\times 0.8984\times 2.0\) mm. While most public brain MRI datasets lack high-quality details along z-axis and therefore are weak to indicate volumetric inconsistency, this volume data provides a good way to illustrate the performances for volumetric synthesis due to the clear blood vessels. We also evaluate our method on the public RIRE dataset [30]. The RIRE dataset includes T1 and T2-weighted MRI volumes, and 17 volumes were used in the experiments.

Implementation Details. To summarize, for the S2M dataset, we randomly select 91 paired volumes for training and 20 paired volumes for inference; for the RIRE T1-to-T2 dataset, 14 volumes are randomly selected for training and 3 volumes are used for inference. All the volumes are resized to \(256\times 256\times 100\) for S2M and \(256\times 256\times 35\) for RIRE, where the last dimension represents the z-axis dimension, i.e., the number of slices in one volume for 2D image-to-image setting. Our proposed method is built upon U-Net backbones. We use a pretrained KL autoencoder with a downsampling factor of \(f = 4\). We train our model on an NVIDIA A100 80 GB GPU.

Table 1. Quantitative comparison on S2M and RIRE datasets.

Quantitative Results. We compare our pipeline to several baseline methods, including 2D-based methods: (1) Pix2pix [11], a solid baseline for image-to-image translation; (2) Palette [22], a diffusion-based method for 2D image translation; 3D-based methods: (3) a 3D version of Pix2pix, created by modifying the 2D backbone as a 3D backbone in the naive Pix2pix approach; and (4) a 3D version of CycleGAN [33]. Naive 3D diffusion-based models are not included due to the lack of efficient backbones and the matter of timesteps’ sampling efficiency. We report the results in terms of mean absolute error (MAE), Structural Similarity Index (SSIM) [29], and peak signal-to-noise ratio (PSNR).

Table 1 presents a quantitative comparison of our method and baseline approaches on the S2M and RIRE datasets. Our method achieves better performance than the baselines in terms of various evaluation metrics. To accelerate the sampling of diffusion models, we implement DDIM [25] with 200 steps and report the results accordingly. It is worth noting that for the baseline approaches, the 3D version method (Pix2pix 3D) outperforms the corresponding 2D version (Pix2pix) at the cost of additional memory usage. For the Palette method, we implemented the 2D version but were unable to produce high-quality slices stably and failure cases dramatically affected the metrics results. Nonetheless, we included this method due to its great illustration of volumetric inconsistency.

Qualitative Results. Figure 2 presents a qualitative comparison of different methods, showcasing two axial slices of clear vessels. Our method synthesizes better images with more details, as shown in the qualitative results. The areas requiring special attention are highlighted with red arrows and red rectangles. It is worth noting that the synthesized axial slices not only depend on the source slice but also on the volume knowledge. For instance, for S2M case 1, the target slice shows a clear vessel cross-section that is based on the shape of the vessels in the volume. In Fig. 3, we provide coronal and sagittal views. For methods that rely on 2D generation, we synthesize individual slices and concatenate them to create volumes. It is clear to observe the volumetric inconsistency examining the coronal and sagittal views of these volumes. For instance, Palette synthesizes 2D slices unstably, where some good slices are synthesized but others are of poor quality. As a result, volumetric inconsistency severely impacts the performance of volumes. While 2D baselines inherently introduce inconsistency in the coronal and sagittal views, 3D baselines also generate poor results than ours, particularly in regard to blood vessels and ventricles.

Fig. 2.
figure 2

Qualitative comparison. We compare our methods with baselines on two cases.

Fig. 3.
figure 3

Coronal view and sagittal view. To clearly indicate the volumetric consistency, we show a coronal view and a sagittal view of the volumes synthesized and the ground truth volumes.

Table 2. Ablation Quantitative Results.

Ablation Analysis. We conduct an ablation study to show the effectiveness of volumetric fine-tuning. Table 2 presents the quantitative results, demonstrating that our approach is able to increase the model’s performance beyond that of the slice-wise model, without incurring significant extra training expenses. Figure 4 illustrates that fine-tuning volumetric layers helps to mitigate volumetric artifacts and produce clearer vessels, which is crucial for medical image synthesis.

Fig. 4.
figure 4

Ablation qualitative results with coronal view and sagittal view.

4 Conclusion

In this paper, we propose Make-A-Volume, a diffusion-based framework for cross-modality 3D medical image synthesis. Leveraging latent diffusion models, our method achieves high performance and can serve as a strong baseline for multiple cross-modality medical image synthesis tasks. More importantly, we introduce a generic paradigm for volumetric data synthesis by utilizing 2D backbones and demonstrate that fine-tuning volumetric layers helps the two-stage model capture 3D information and synthesize better images with volumetric consistency. We collected an in-house SWI-to-MRA dataset with clear blood vessels to evaluate volumetric data quality. Experimental results on two brain MRI datasets demonstrate that our model achieves superior performance over existing baselines. Generating coherent 3D and 4D data is at an early stage in the diffusion models literature, we believe that by leveraging slice-wise models and extending them to 3D/4D models, more work can help achieve better volume synthesis with reasonable memory requirements. In the future, we will investigate more efficient approaches for more high-resolution volumetric data synthesis.