Keywords

1 Introduction

Magnetic resonance imaging (MRI) is widely used for diagnosis and treatment monitoring as it provides structural and physiological information related to disease progression. Diffusion MRI (dMRI) measures molecular diffusion in biological tissues and provides microscopic details of tissue architecture, as molecules interact with many different obstacles while diffusing throughout tissues [16]. However, dMRI requires repeated acquisitions with different diffusion directions. Echo-planar imaging (EPI), which enables fast encoding per imaging slice, is commonly used for dMRI due to its fast acquisition time. However, single-shot (ss-) EPI is susceptible to severe susceptibility-induced geometric distortion and \(T_2\)- and \(T_2^{*}\)-induced voxel blurring. These artifacts worsen at higher in-plane resolutions as the time required to acquire each line of k-space increases approximately linearly.

Multi-shot (ms-) acquisition is an effective approach to mitigate EPI-related artifacts, which segments k-space into multiple portions covered across multiple repetition times (TRs) to reduce the effective echo spacing. However, potential shot-to-shot phase variations across multiple EPI shots can introduce additional artifacts. Recent algorithms, such as low-rank prior methods like low-rank modeling of local k-space neighborhoods (LORAKS) [7, 8, 14, 15, 17, 18], and multi-shot sensitivity encoded diffusion data recovery using structured low-rank matrix completion (MUSSELS) [19], have successfully addressed this challenge by jointly reconstructing msEPI images through a low-rank constraint applied across the EPI shots.

In recent years, deep learning has emerged as a promising approach for image reconstruction, offering potential solutions to the challenges of existing techniques, including long reconstruction times, residual artifacts at high acceleration factors, and over-smoothing [6, 9, 10]. One notable development is model-based deep learning (MoDL), which leverages an unrolled convolutional neural network (CNN) and a parallel imaging (PI) forward model to denoise and unalias undersampled data [1]. MoDL has also been applied to multi-shot diffusion-weighted echo-planar imaging, known as MoDL-MUSSELS, effectively replacing MUSSELS and significantly reducing reconstruction times while achieving comparable results to state-of-the-art methods [2]. MoDL-MUSSELS includes CNN denoisers in both image- and k-spaces, as recent work has demonstrated that utilizing both domains has yielded improvement in performance based on metrics such as PSNR and SSIM [6]. However, it is worth noting that existing deep learning networks for dMRI have typically been trained in a supervised manner, which requires a significant amount of ground truth images that are not easily acquired in EPI acquisitions.

In contrast, self-supervised learning [3, 24, 25] does not rely on external training data and can be used in denoising, reconstruction, quantitative mapping, and other applications. Recent advancements in zero-shot self-supervised learning (ZS-SSL) have demonstrated successful scan-specific network training without any external database [24]. This approach has shown comparable or superior results to supervised networks. However, in dMRI, where the same volume is repeatedly acquired while changing diffusion directions, ZS-SSL typically requires training separate networks for different directions, which can be impractical.

The virtual coil (VC) approach is a highly effective technique for enhancing the performance of parallel MRI [5], particularly in the case of EPI that utilizes partial Fourier acquisition. VC generates virtual coils by incorporating conjugate symmetric k-space signals from actual coils. This integration provides supplementary information for missing data points in k-space, further being useful when combined with partial Fourier acquisition. Conceptually, the utilization of VC consistently ensures an image quality equivalent to or exceeds that of the image reconstructed without VC.

In this study, we propose a novel msEPI reconstruction method called zero-MIRID (zero-shot self-supervised learning of Multi-shot Image Reconstruction for Improved Diffusion MRI). Our method jointly reconstructs msEPI data by incorporating zero-shot self-supervised learning-based image reconstruction. Our key contributions are as follows:

  • We jointly reconstruct multiple-shot images using self-supervised learning.

  • We train one network for all diffusion directions, which accelerates training speed and improves performance.

  • We used network denoisers in both k- and image-space and employed the VC [5] to improve the conditioning of the reconstruction.

  • In the in-vivo experiment, the proposed method demonstrates more robust images and better diffusion metrics than the state-of-art PI technique for dMRI.

  • To our best knowledge, this study proposes the first self-supervised learning reconstruction for dMRI.

Overall, our zero-MIRID method offers a promising approach to enhance msEPI reconstruction in dMRI, providing improved image quality and diffusion metrics through the integration of self-supervised learning techniques.

2 Method

2.1 PI Techniques for dMRI

For msEPI data, SENSE is commonly used for image reconstruction. SENSE individually reconstructs each shot’s data using the spatial variation of the coil sensitivity profile. The \(m^{th}\) shot image in the \(d^{th}\) diffusion direction, \(x_{d,m}\), can be reconstructed as follow.

$$\begin{aligned} x_{d,m} = \underset{x_{d,m}}{\textrm{argmin}} {\left\| \mathbf {\mathcal {F}}_m\textbf{C}x_{d,m}-b_{d,m} \right\| }_2^2 \end{aligned}$$
(1)

where \(\mathcal {F}_{m}\) is the undersampled Fourier transform for the \(m^{th}\) shot, C is the coil sensitivity map, and \(b_{d,m}\) is the acquired k-space data of \(d^{th}\) direction and \(m^{th}\) shot.

On the other hand, MUSSELS and LORAKS jointly reconstruct multiple-shot images using the low-rank property among msEPI data. The images in the \(d^{th}\) diffusion direction can be reconstructed using LORAKS as follows.

$$\begin{aligned} x_d=\underset{x_d}{\textrm{argmin}}\sum _{m=0}^{M}{{\left\| \mathbf {\mathcal {F}}_m\textbf{C} x_{d,m}-b_{d,m} \right\| }_2^2}+\lambda \mathcal {J}(\mathbf {\mathcal {F}}x_d) \end{aligned}$$
(2)

where \(\mathcal {J}\) is the LORAKS regularization. In this work, we utilized S-LORAKS, which employs phase information and k-space symmetry [14, 15].

2.2 Network Design

Fig. 1.
figure 1

The proposed image reconstruction diagram of zero-MIRID. The virtual coil (VC) layer was used to efficiently reconstruct the data accelerated by partial Fourier. The network denoisers in both the k-space and image domain were used. The DC layer enforces the consistency between the acquired data and the reconstructed images.

Figure 1 shows the proposed network diagram of zero-MIRID. The input of the network is \(A_m^Tb_d\), where \(A_m=\mathcal {F}_mC\). The network consists of two CNNs in the k- and image-spaces. The VC was added and removed before and after the denoising CNNs, respectively. The images in the \(d^{th}\) diffusion direction can be jointly reconstructed using zero-MIRID as follows.

$$\begin{aligned} \begin{aligned} x_d =&\underset{x_d}{\textrm{argmin}}\sum _{m=0}^{M}{{\left\| \mathbf {\mathcal {F}}_m\textbf{C}x_{d,m}-b_{d,m} \right\| }_2^2} \\&+ \lambda _1\left\| \mathcal {V}_C^H{N}_i\mathcal {V}_Cx_d\right\| _2^2+\lambda _2\left\| \mathcal {V}_C^H\mathcal {F}^H{N}_k\mathcal {F}\mathcal {V}_Cx_d\right\| _2^2 \end{aligned} \end{aligned}$$
(3)

where \(\mathcal {V}_C\) is the VC operator, and \(N_i\) and \(N_k\) are denoising CNNs in the image- and k-space, respectively. We define \(Nx=x-Dx\), where D is the CNN network, and modified the alternating minimization-based solution in [2] to get the solutions of equation (3), as follows.

$$\begin{aligned} \begin{aligned} x_{n+1} =&(A^H A + \lambda _1 I + \lambda _2 I) (A^H b + \lambda _1 \eta _n + \lambda _2 \zeta _n) \\ \zeta _{n+1} =&\mathcal {V}^H_C \mathcal {F} ^H D_k \mathcal {F} \mathcal {V}_C x_{n+1} \\ \eta _{n+1} =&\mathcal {V}^H_C D_i \mathcal {V}_C x_{n+1} \end{aligned} \end{aligned}$$
(4)

where n is the optimization step (iteration) number, \(\eta \) and \(\zeta \) is the network denoising terms in k- and image-space, and \(A=\mathcal {F}C\).

2.3 Zero-Shot Self-supervised Learning

Fig. 2.
figure 2

The masks used for training, validation, and inference phases. The sampling mask was split into three different masks.

As proposed in the recent ZS-SSL study [24], we split the sampling mask into three different groups, as shown in Fig. 2, where \(g_{3}\) is the entire sampling mask and \(g_3 \supset g_2 \supset g_1\). In the training phase, \(g_{1}\) was used for network input, while \(g_{2}\) was used to calculate training losses. In the validating phase, \(g_{2}\) was used for network input, while \(g_{3}\) was used to calculate validating losses. In the inferencing phase, \(g_{3}\) was used for network inputs. The loss in the \(d^{th}\) direction in the training phase can be described as follows.

$$\begin{aligned} \mathcal {L}(g_2 \cdot b_{d}, \, g_2 \cdot A f(g_1 \cdot b_{d}; \theta ) ) \end{aligned}$$
(5)

where \(\mathcal {L}\) is the loss function, f is the zero-MIRID reconstruction, and \(\theta \) is the trainable network parameters. Similarly, the loss in the \(d^{th}\) direction in the validating phase can be described as follows.

$$\begin{aligned} \mathcal {L}(g_3 \cdot b_{d}, \, g_3 \cdot A f(g_2 \cdot b_{d}; \theta ) ) \end{aligned}$$
(6)

In this study, we used the normalized root mean square error (NRMSE) and normalized mean absolute error (NMAE) as the loss functions.

2.4 Experiment Details

In-vivo experiments were conducted on a 3T Siemens Prisma system with a 32-channel head coil. For dMRI, we acquired the diffusion-weighted data in 32 different directions using 2-shot EPI, with each shot accelerated by 5-fold (R=5) and employing 75% Partial Fourier, resulting in 15% coverage of the k-space in each shot relative to a fully-sampled readout. Imaging parameters are; field of view (FOV)=\(224\,\times 224\,\times \,128\) mm\(^{3}\), voxel size =\(1\,\times 1\,\times 4\) mm\(^{3}\), TR=3.5 s, and effective echo time (TE) =59 ms.

SENSE and S-LORAKS reconstructions were performed with MATLAB R2022a using Intel Xeon 6248R and 512 GB RAM. All neural network implementations were conducted with Python, using the Keras library in Tensorflow 2.4.1. NVIDIA Quadro RTX 8000 (RAM: 48 GB) was used to train, validate, and test the network. The denoising CNNs consist of 16 layers of which the depth is 46. For the 16 layer-CNN, we employed a filter size of 3\(\,\times \,\)3. The depth of our network is 46, resulting in a total of 583,114 trainable parameters. The DC layer takes ten conjugate gradient steps, and the reconstruction block iterates ten times, where the MoDL paper [1] has demonstrated the saturated performance. For training the model, the Adam optimizer is used with a learning rate of 1e-3. Leaky ReLU was used as the activation function. For every diffusion direction, one \(g_{2}\) and 50 cases of \(g_{1}\) were generated. The ratio of the number of k-space points of \(g_{3}\):\(g_{2}\):\(g_{1}\) = 1.00:0.80:0.48. We trained a single network for 32 diffusion directions and used that network to reconstruct all directions. For comparison, we trained two separate networks for the individual reconstruction for each shot (zero-SIRID, single-shot image reconstruction). We used the FSL toolbox for diffusion analysis [13, 22, 23]. To estimate multiple fiber orientations, we used the Bayesian Estimation of Diffusion Parameters Obtained using Sampling Techniques (BEDPOSTX) [4, 11, 12].

Example data and code can be found in the following link:

https://github.com/jaejin-cho/miccai2023

Fig. 3.
figure 3

The reconstructed diffusion-weighted images at R=5 per shot. Selected diffusion directions were shown. Reference images were obtained from 5-shot EPI data with S-LORAKS reconstruction. SENSE and zero-SIRID individually reconstruct each shot image, whereas S-LORAKS and zero-MIRID jointly reconstruct two shot images. NRMSE was shown at the bottom of each image.

3 Results

Figure 3 the reconstructed diffusion-weighted images at 5-fold acceleration per shot in the selected diffusion directions. The reference images were obtained from 5-shot EPI data that covers complementary k-space lines to each other with the S-LORAKS constraint. While SENSE shows severe noise amplification and remaining folding artifacts, zero-SIRID was able to partially mitigate the noise amplification. S-LORAKS jointly reconstructed two shots, considerably reduced noise, and improved the signal-to-noise ratio (SNR). Nonetheless, in the selected diffusion directions, folding artifacts were amplified, and the center of the image shows a dropped signal (pointed by yellow arrows). In contrast, zero-MIRID demonstrated robust image reconstruction even with a high reduction factor per shot. The NRMSE and NAE across the diffusion direction are provided in the supplementary material, demonstrating notable reductions in NRMSE and NMAE when the proposed method is compared to S-LORAKS.

Fig. 4.
figure 4

Average DWI, FA map, and 2nd crossing fiber image from the reconstructed images in Fig. 3. The number of 2nd crossing fibers was shown at the bottom of each column.

Figure 4 presents the average diffusion-weighted image (DWI), fractional anisotropy (FA) map, and 2nd crossing fiber image calculated from the reconstructed images. S-LORAKS and zero-MIRID produced high-fidelity average DWIs, whereas SENSE and zero-SIRID show remaining artifacts. SENSE, zero-SIRID, and S-LORAKS show amplified noise in the center of the FA maps, whereas zero-MIRID effectively mitigated the noise. Furthermore, zero-MIRID well preserved the number of 2nd crossing fibers, often considered a crucial factor in evaluating successful dMRI acquisition [4, 12].

4 Discussion and Conclusion

In this study, we proposed an improved image reconstruction method for msEPI and dMRI in a self-supervised deep learning manner. In-vivo experiment demonstrates the proposed method outperformed S-LORAKS, the state-of-art PI method for dMRI.

Acquiring reference images of msEPI can be challenging because each shot is typically highly accelerated and shot-to-shot phase variation prevents jointly reconstructing multiple shots efficiently. Advanced PI techniques that jointly reconstruct many EPI shots can improve the PI condition and provide high-fidelity images, but using a PI method may induce bias to that particular method. Therefore, supervised learning might not be an ideal solution for msEPI. On the other hand, self-supervised learning, which does not require reference images, could be a more suitable approach for msEPI. Due to the difficulty in obtaining reliable ground truth data, conventional quantitative metrics such as SSIM and NRMSE may be less reliable for evaluation. In dMRI, FA maps and 2nd crossing fibers could be used for obtaining more suitable metrics.

We trained a single network for all diffusion directions, which improved performance and reduced training time (please see the supplementary material). NRMSE and NMAE were reduced from 14.69% to 13.61% and from 15.73% to 14.41%, respectively. The training time for the proposed network was 22:30 min per diffusion direction/slice (on GPU). This is expected to be reduced by transfer learning. Inference took approximately 1 s per direction/slice, and 2-shot LORAKS took approximately 20 s per direction/slice (on CPU). Since the images are highly similar across diffusion directions, training on the entire diffusion direction has a similar effect to increasing the size of the training database, thereby enhancing network training. Moreover, using a single network for all directions reduces training time compared to training separate networks for each direction, from 40:01 min to 22:30 min per diffusion direction and slice.

As a future work, the simultaneous multi-slice (SMS) technique [21], which is often used for further acceleration, can be easily incorporated into the current network (please see the preliminary images in the supplementary material). At Rsms=\(5\times 2\)-fold acceleration, NRMSE and NMAE were significantly reduced compared with SENSE, from 22.91% to 9.07% and from 26.09% to 11.12%, respectively. g-Slider could be a good application as well [20], because RF-encoded images also have highly similar image features.