Keywords

1 Introduction

Cardiac magnetic resonance (CMR) stands as a vital clinical tool for assessing cardiovascular diseases due to its non-invasive and radiation-free nature, enabling a comprehensive evaluation of cardiovascular aspects, such as structure, function, flow, perfusion, viability, tissue characterization, as well as the assessment of myocardial fibrosis and other pathologies [1,2,3]. Key CMR applications include cine MR imaging and T1/T2 mapping.

However, CMR faces inherent physical challenges, primarily the time consuming MRI acquisition process. The requirement for increased spatiotemporal resolution in cardiac imaging further amplifies this challenge. To mitigate prolonged scan times, accelerated MRI acquisitions are utilized by obtaining undersampled k-space data, though this approach violates the Nyquist-Shannon sampling criterion [4].

In the broader MRI domain, conventional techniques such as Parallel Imaging (PI) [5, 6] and Compressed Sensing (CS) [7, 8] have been employed to accelerate MRI data acquisition. These approaches leverage spatial sensitivity information from multiple receiver coil arrays and exploit the sparsity or compressibility of MRI data. However, these methods have limitations, such as noise amplification in PI, and assumptions of sparsity that may not hold for all MRI data in CS, whilst finding optimal parameters for CS methods might be computationally and time consuming.

In the last decade, Deep Learning (DL) has revolutionized MRI image reconstruction, exhibiting superior performance compared to traditional methods, especially in accelerated MRI reconstruction tasks [9]. DL-based algorithms can learn complex image representations directly from available datasets, enabling enhanced image reconstruction from undersampled k-space measurements, often in supervised learning [10,11,12,13], or self-supervised settings [14]. This advancement holds significant potential to impact CMR by elevating the image quality of reconstructed highly undersampled data while concurrently reducing breath-hold duration.

In this work, motivated by the need for reducing acquisition times and breath-hold durations further during CMR, we employ vSHARP [15] (variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse-Problems), a DL-based inverse problem solver, previously applied on brain and prostate MR imaging exhibiting state-of-the-art performance. We particularize vSHARP for accelerated Cardiac MRI Reconstruction and introduce in Sect. 3.1 two variants by treating the problem at hand as a 2D reconstruction task (2D model) or as a 2D dynamic reconstruction task (3D model). Additionally, in Sect. 3.2 we propose various training techniques to boost model training and generalizability across unseen cardiac (cine and T1/T2) MRI data. In Sect. 5, we experimentally compare our two approaches, highlighting that our 2D dynamic implementation outperforms traditional 2D reconstruction and we further compare our models with current state-of-the-art approaches.

2 Theory and Problem Formulation

2.1 Accelerated MRI Reconstruction

Recovering a two-dimensional image \(\textbf{x}^{*}\in \mathbb {C}^{N}\) from undersampled multi-coil (assume \(N_c\) coils) k-space measurements \(\tilde{\textbf{y}}\in \mathbb {C}^{N\times N_{c}}\) can be formulated as a minimisation problem as follows:

$$\begin{aligned} \textbf{x}^{*} = \mathop {\textrm{argmin}}\limits _{\textbf{x}\in \mathbb {C}^{N}}\frac{1}{2} \sum _{k=1}^{N_c}\left| \left| \mathcal {A}^{k}(\textbf{x}) - \tilde{\textbf{y}}^{k}\right| \right| _2^2 + \mathcal {R}(\textbf{x}), \quad \mathcal {A}^{k} = \textbf{U} \mathcal {F} \textbf{S}^{k}, \end{aligned}$$
(1)

where \(\mathcal {A}^{k}\) represents the forward or corruption operator per coil. It involves mapping the image to an individual coil image using a known sensitivity map \(\textbf{S}^{k}\), transforming it to the k-space domain via the Fast Fourier Transform (FFT) \(\mathcal {F}\), and undersampling with \(\textbf{U}\). The function \(\mathcal {R}: \mathbb {C}^{N} \rightarrow \mathbb R\) denotes a regularization functional, which is assumed to impose prior knowledge about the image.

In the context of cardiac magnetic resonance, acquisitions are typically dynamic and synchronized with electrocardiography (ECG)-derived cardiac cine. In dynamic acquisitions, multiple undersampled k-space data \(\tilde{\textbf{y}}\in \mathbb {C}^{N\times N_{c} \times N_{f}}\) are obtained at \(N_{f}\) time frames. Consequently, Eq. 1 is adapted as follows:

$$\begin{aligned} \textbf{x}^{*}_{\text {d}} = \mathop {\textrm{argmin}}\limits _{\textbf{x}\in \mathbb {C}^{N \times N_{f}}}\frac{1}{2} \sum _{t=1}^{N_f}\sum _{k=1}^{N_c}\left| \left| \mathcal {A}^{k}(\textbf{x}_{\cdot , t}) - \tilde{\textbf{y}}_{\cdot , t}^{k}\right| \right| _2^2 + \mathcal {R}(\textbf{x}), \quad \mathcal {A}^{k} = \textbf{U} \mathcal {F} \textbf{S}^{k}. \end{aligned}$$
(2)

In dynamic acquisitions, it is often assumed that knowledge can be shared across time frames or that the motion pattern is known, thereby requiring the selection of an appropriate prior \(\mathcal {R}: \mathbb {C}^{N \times N_{f} } \rightarrow \mathbb R\) that incorporates this information [16].

3 Methods

3.1 Deep Learning Framework

Sensitivity Map Prediction. In conventional settings, sensitivity maps are estimated from the autocalibration signal (ACS) data, often incorporating a portion of the center of the k-space. Advanced techniques for refining these estimated sensitivities include ESPIRiT or GRAPPA [5, 17]. However, these approaches can impose computational constraints. To overcome the need for such computationally expensive algorithms, we employ a two-dimensional deep learning module, specifically a 2D U-Net [18]. This model takes ACS-estimated sensitivity maps as input and produces refined versions of them as output. The predicted sensitivity maps \(\left\{ \textbf{S}^{k} \right\} _{k=1}^{N_c}\) are used for downstream reconstruction tasks, and the sensitivity module is trained in an end-to-end manner along with the reconstruction model.

Reconstruction via ADMM Unrolled Optimization. Our approach utilizes vSHARP [15], a DL-based inverse problem solver, to address Eq. 1. vSHARP employs the half-quadratic variable splitting method [19] to transform the optimization problem in Eq. 1 by introducing an intermediate variable \(\textbf{w}\). It then unrolls the optimization process over T iterations using the alternating direction method of multipliers algorithm (ADMM) [20], as follows:

$$\begin{aligned} \textbf{w}^{(j+1)} = \mathop {\textrm{argmin}}\limits _{\textbf{w}\in \mathbb {C}^{N}} \mathcal {R}(\textbf{w}) + \frac{\lambda }{2} \big | \big | \textbf{x}^{(j)} - \textbf{w} + \frac{\textbf{m}^{(j)}}{\lambda } \big | \big |_2^2, \end{aligned}$$
(3a)
$$\begin{aligned} \textbf{x}^{(j+1)} = \mathop {\textrm{argmin}}\limits _{\textbf{x}\in \mathbb {C}^{N}} \frac{1}{2} \sum _{k=1}^{N_c}\left| \left| \mathcal {A}^{k}(\textbf{x}) - \tilde{\textbf{y}}^{k}\right| \right| _2^2 + \frac{\lambda }{2} \big | \big | \textbf{x} - \textbf{w}^{(j+1)} + \frac{\textbf{m}^{(j)}}{\lambda } \big | \big |_2^2, \end{aligned}$$
(3b)
$$\begin{aligned} \textbf{m}^{(j+1)} = \textbf{m}^{(j)} + \lambda (\textbf{x}^{(j+1)} - \textbf{w}^{(j+1)}), \quad j=0,\cdots , T-1. \end{aligned}$$
(3c)

Our method incorporates U-Nets to replace the need for manually selecting a prior functional \(\mathcal {R}\) in Eq. 3a and learn the solution from data directly, namely the denoising step. Next, data consistency is enforced by solving Eq. 3b via an unrolled (differentiable) gradient descent scheme. Our approach initializes \(\textbf{w}^{(0)}\) and \(\textbf{x}^{(0)}\) using a zero-filled reconstruction with \(\tilde{\textbf{y}}\) and the predicted coil sensitivity maps: \(\textbf{w}^{(0)} = \textbf{x}^{(0)} := \sum _{k=1}^{N_c}{\textbf{S}^{k}}^{*} \mathcal {F}^{-1} (\tilde{\textbf{y}}^{k}) \). Additionally, a learned initializer, adapted from [13], is used to determine an initialization for the Lagrange Multipliers \(\textbf{m}^{(0)}\). For dynamic reconstruction as in Eq. 2, Eq. 3 is replaced by:

$$\begin{aligned} \textbf{w}^{(j+1)} = \mathop {\textrm{argmin}}\limits _{\textbf{w}\in \mathbb {C}^{N\times N_f}} \mathcal {R}(\textbf{w}) + \frac{\lambda }{2} \big | \big | \textbf{x}^{(j)} - \textbf{w} + \frac{\textbf{m}^{(j)}}{\lambda } \big | \big |_2^2, \end{aligned}$$
(4a)
$$\begin{aligned} \textbf{x}^{(j+1)} = \mathop {\textrm{argmin}}\limits _{\textbf{x}\in \mathbb {C}^{N\times N_f}} \frac{1}{2} \sum _{t=1}^{N_f}\sum _{k=1}^{N_c}\left| \left| \mathcal {A}^{k}(\textbf{x}_{\cdot , t}) - \tilde{\textbf{y}}_{\cdot , t}^{k}\right| \right| _2^2 + \frac{\lambda }{2} \big | \big | \textbf{x} - \textbf{w}^{(j+1)} + \frac{\textbf{m}^{(j)}}{\lambda } \big | \big |_2^2, \end{aligned}$$
(4b)
$$\begin{aligned} \textbf{m}^{(j+1)} = \textbf{m}^{(j)} + \lambda (\textbf{x}^{(j+1)} - \textbf{w}^{(j+1)}), \quad j=0,\cdots , T-1. \end{aligned}$$
(4c)

3.2 Model Training Techniques

In this section, we outline the various additional techniques employed in our paper to enhance the performance of our models.

Joint Modality Training. During the training of our DL-based approach, we jointly trained it using all available data at our disposal (see Sect. 4.3). This approach served a dual purpose; Firstly, instead of training separate models for each modality, our joint modality training aimed to utilize a larger dataset promoting more effective learning and generalization. Moreover, by integrating cine and T1/T2-weighted MRI data, we aimed to harness the complementarity between these modalities. This approach enabled the model to exploit the shared features and correlations, potentially improving the reconstruction quality for both modalities.

Random k -space Cropping. To optimize computational efficiency during training, we utilized random cropping on the fully-sampled multi-coil k-space data. Since direct cropping of the k-space would be inappropriate, we first applied the inverse Fast Fourier Transform (FFT) to reconstruct it into fully-sampled multi-coil images. Subsequently, random cropping was performed on this reconstructed image, and the resulting cropped image was transformed back to the k-space domain (via FFT). The k-space data was then undersampled and used as input to our model. This approach not only offered computational benefits but also allowed our model to gain exposure to different parts of the reconstructed data, including background noise and the regions of interest, without compromising overall reconstruction quality as compared to using non-cropped data. Figure 1 illustrates examples of cropped images before the transformation back to the k-space domain. It’s important to note that for dynamic data, the same cropping process was applied to all time frames.

Fig. 1.
figure 1

Randomly cropped (in the image domain) examples of cine and T1/T2-weighted MRI images from the dataset. These images are then transformed to the k-space domain, followed by retrospective undersampling, and are subsequently utilized for training.

Multi-scheme Undersampling. Undersampling for the target (validation) data comprised Cartesian rectilinear equispaced undersampling masks, with 24 fully-sampled ACS (central) lines, and with acceleration factors of \(R=\) 4, 8 and 10. Inspired by previous work [21], which demonstrated enhanced model generalizability in reconstructing Cartesian rectilinear data, we employed a multi-scheme undersampling setup during training. Alongside the provided undersampling pattern, we used the following undersampling schemes: Equispaced and Random Cartesian rectilinear, Gaussian 2D Cartesian, and pseudo-Radial and pseudo-Spiral schemes. These undersampling schemes are visualized in Fig. 2. Note that for dynamic data, the same undersampling scheme was applied on all time frames.

Fig. 2.
figure 2

Undersampling Schemes during training.

Dual Domain Loss. To train our models we designed a dual-domain loss:

$$\begin{aligned} \mathcal {L}_{\boldsymbol{\phi }} = \mathcal {L}_{\boldsymbol{\phi }}^{img} + \mathcal {L}_{\boldsymbol{\phi }}^{freq}, \end{aligned}$$
(5)

where \(\mathcal {L}_{\boldsymbol{\phi }}^{img}\) and \( \mathcal {L}_{\boldsymbol{\phi }}^{freq}\) represent losses computed in the image and frequency domain, respectively.

Image Domain Loss. The image domain loss, \(\mathcal {L}_{\boldsymbol{\phi }}^{img}\), is computed between the ground truth RSS image \(\textbf{x}\) and the magnitude of the model-predicted image \(\hat{\textbf{x}}_{\boldsymbol{\phi }}\). This loss comprises several components:

$$\begin{aligned} \mathcal {L}_{\boldsymbol{\phi }}^{img} = \lambda _{ \text {SSIM}}\mathcal {L}_\text {SSIM}\left( \textbf{x},\, \hat{\textbf{x}}_{\boldsymbol{\phi }}\right) + \lambda _{1} \mathcal {L}_{1} \left( \textbf{x},\, \hat{\textbf{x}}_{\boldsymbol{\phi }}\right) + \lambda _{\text {HFEN}_{1}}\mathcal {L}_{\text {HFEN}_{1}} \left( \textbf{x},\, \hat{\textbf{x}}_{\boldsymbol{\phi }}\right) \end{aligned}$$
(6)

which are defined as follows:

$$\begin{aligned} \begin{gathered} \mathcal {L}_{\text {SSIM}} (\textbf{u},\,\textbf{v}) = 1- {\text {SSIM}} (\textbf{u},\,\textbf{v}), \quad \mathcal {L}_{1} (\textbf{u},\,\textbf{v}) = \left| \left| \textbf{u} - \textbf{v} \right| \right| _1, \\ \text {and,} \quad \mathcal {L}_{\text {HFEN}_1}(\textbf{u},\, \textbf{v}) = \text {{HFEN}}_1(\textbf{u},\, \textbf{v}). \end{gathered} \end{aligned}$$
(7)

In Eq. 7, SSIM denotes the Structural Similarity Index Measure, computed over W windows, each of size \(7\times 7\) pixels extracted from images \(\textbf{u}\) and \(\textbf{v}\). It is defined as:

$$\begin{aligned} \text {SSIM}(\textbf{u},\,\textbf{v}) = \frac{1}{W}\sum _{i=1}^{W} \frac{(2\mu _{\textbf{u}_i}\mu _{\textbf{v}_i} + 0.01)(2\sigma _{\textbf{u}_i\textbf{v}_i} + 0.03)}{({\mu ^2_{\textbf{u}_i}} +{\mu ^2_{\textbf{v}_i}} + 0.01)({\sigma ^2_{\textbf{u}_i}} + {\sigma ^2_{\textbf{v}_i}} + 0.03)}. \end{aligned}$$
(8)

Here, \(\mu _{\textbf{u}_i}\), \(\mu _{\textbf{v}_i}\), \(\sigma _{\textbf{u}_i}\) and \(\sigma _{\textbf{v}_i}\) represent the means and standard deviations of each window, while \(\sigma _{\textbf{u}_i\textbf{v}_i}\) signified the covariance between \(\textbf{u}_i\) and \(\textbf{v}_i\). HFEN\(_1\) represents the High-Frequency Error Norm, and is defined as follows:

$$\begin{aligned} \text {{HFEN}}_1(\textbf{u},\, \textbf{v})\, = \, \frac{|| \text {{G}}(\textbf{u}) - \text {{G}}(\textbf{v}) ||_1}{||\text {{G}}(\textbf{u})||_1}, \end{aligned}$$
(9)

where \(\text {G}\) denotes a \(15\times 15\) Laplacian of Gaussian filter with a standard deviation of 2.5.

SSIM and HFEN are computed per single 2D slice/time frame. For dynamic reconstruction experiments, we also incorporated \(\lambda _{\text {SSIM3D}} \mathcal {L}_{\text {SSIM3D}}\), which computes the SSIM metric for volumes using windows of voxel-size \(7\times 7 \times 7\).

Frequency Domain Loss. The frequency domain loss, \(\mathcal {L}_{\boldsymbol{\phi }}^{freq}\), was computed between the ground truth multi-coil k-space \(\textbf{y}\) and the k-space transformation of the model predicted image \(\hat{\textbf{y}}_{\boldsymbol{\phi }}\):

$$\begin{aligned} \mathcal {L}_{\boldsymbol{\phi }}^{freq} = \lambda _{\text {NMAE}}\mathcal {L}_{\text {NMAE}}\left( \textbf{y},\, \hat{\textbf{y}}_{\boldsymbol{\phi }}\right) , \text { where } \mathcal {L}_\text {NMAE} (\textbf{u},\, \textbf{v})\,= \, \frac{||\textbf{u}\,-\,\textbf{v}||_1}{||\textbf{u}||_1}. \end{aligned}$$
(10)

The choice of the weighting factors \(\lambda _{\text {SSIM}}\), \(\lambda _{\text {SSIM3D}}\), \(\lambda _{1}\), \(\lambda _{\text {HFEN}_{1}}\), \(\lambda _{\text {NMAE}} \ge 0\) are hyperparameters that determine the influence of each loss component in the overall optimization process.

4 Experimental Setup

We conducted two sets of experiments, addressing the reconstruction task from two perspectives: a 2D reconstruction problem and a 2D dynamic reconstruction problem involving spatial dimensions and time.

4.1 2D Reconstruction

In this setup, our goal was to solve Eq. 3. We utilized 2D U-Nets with four scales as denoisers, each featuring 32 filters in the initial scale. The optimization process involved 16 steps (T = 16). Data consistency in Eq. 3b was ensured through 14 gradient descent iterations. For the sensitivity model, we employed a 2D U-Net with four scales and 32 filters for the first scale. This configuration focused on reconstructing 2D images. The input consisted of undersampled multi-coil k-space data from single slices or frames, and the output comprised 2D images.

4.2 2D Dynamic Reconstruction

In this configuration, we approached the reconstruction challenge dynamically, utilizing the formulation presented in Eq. 4. Our model took as input a sequential series of time frames featuring 2D undersampled multi-coil k-space data. Our objective was to generate a corresponding sequential series of time-frame images as the output. In contrast to the previous setup, we employed 3D U-Nets, incorporating four scales and 32 filters in the initial scale. However, to accommodate GPU memory constraints, we limited the optimization steps to T = 10 and conducted 8 gradient descent iterations for data consistency. Similarly to the 2D reconstruction setup, for the sensitivity model we utilized a 2D U-Net with four scales and 32 filters in the initial scale.

4.3 Dataset

We conducted our experiments using the CMRxRecon dataset [22], containing 4D multi-coil Cine and multi-contrast k-space data acquired on a 3T MRI scanner with protocols outlined in [23]. The Cine MRI data included short-axis (SAX) and long-axis (LAX) views, while the multi-contrast data encompassed T1 and T2-weighted MRI data. For training, we had access to a total of 203 cine and 240 multi-contrast 4D volumes of fully-sampled k-spaces. The validation dataset comprised 111 cine and 118 multi-contrast 4D volumes of undersampled k-spaces at acceleration factors of 4, 8, and 10.

4.4 Training and Optimization Details

Our models were implemented and optimized using PyTorch [24]. The Deep Image Reconstruction Toolkit (DIRECT) [25] facilitated our pipeline tools. We employed Adam as the model parameter optimizer, with \( \epsilon =10^{-8}\) and \(\left( \beta _1,\beta _2\right) = \left( 0.9, 0.999 \right) \). Training was conducted on four NVIDIA A100 80GB GPUs with a batch size of 1 and 2 on each GPU, for dynamic and non-dynamic tasks, respectively.

For both experimental setups, the loss computation used these weighting parameters: \(\lambda _{\text {SSIM}} \, = \, \lambda _{1} \, = \, \lambda _{\text {HFEN}_{1}} \, = \, 1.0\), and \(\lambda _{\text {NMAE}} \, = \, 3.0\). For 2D dynamic reconstruction (Sect. 4.2), we employed both versions of the SSIM loss, computed per 2D slice and across the entire sequence, and we set \(\lambda _{\text {SSIM3D}} \, = \, 1.0\).

4.5 Comparisons

To evaluate our proposed methods, we compared them against two state-of-the-art 2D MRI reconstruction approaches, the Recurrent Variational Network (RecurrentVarNet) [13], wining method in the MultiCoil MRI Reconstruction Challenge [10] and the End-to-end Variational Network (E2EVarNet), one of the top-performing solutions in the fastMRI challenge [12]. Both approaches were trained using the same settings and techniques as used for our proposed methods.

4.6 Evaluation Metrics

Metrics used for evaluation were the structural similarity index measure (SSIM), the normalized mean-squared-error (NMSE), and the peak signal-to-noise ratio (PSNR).

5 Results

Table 1. Average evaluation metrics on the validation set for each modality.

In Fig. 3 we present sample reconstructions and in Table 1 are presented the reconstruction evaluation results on the validation dataset, from both of our experimental setups. Additionally, we include results from the two methods employed for comparison: the RecurrentVarNet and the E2EVarNet. We can observe that both, 2D reconstruction and 2D dynamic reconstruction with vSHARP, yielded superior results in terms of quantitative metrics, surpassing both the RecurrentVarNet and the E2EVarNet. However, the 2D dynamic reconstruction setup outperforms the 2D reconstruction for both Cine and Multi-Contrast tasks.

Additionally, in Table 2, we present the time required for volume reconstruction in seconds across the two experimental setups detailed in this work. From Table 2 is evident that in overall, the 2D dynamic reconstruction surpasses the 2D reconstruction in both Cine and Multi-Contrast scenarios.

Fig. 3.
figure 3

Sample reconstructions from the 10\(\times \) undersampled validation set.

Table 2. Time for reconstruction per volume (in seconds).

6 Conclusion and Discussion

In this work we employed the variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse-Problems (vSHARP) network, a state-of-the-art DL-based method, to the task of reconstructing undersampled Cardiac MRI data. We adapted vSHARP under two settings, one that considers the reconstruction problem as a 2D reconstruction task, i.e., each image at a specific time frame is treated individually, and one that it considers it as a dynamic task by operating on all time frame data within a given sequence.

Upon reviewing the Table 1, it becomes evident that both of our proposed methods have demonstrated superior performance compared to the alternatives. In addition, as anticipated and demonstrated in other works [26], our empirical findings confirm that 2D dynamic reconstruction outperforms the traditional 2D reconstruction. This improved performance of the 2D dynamic model can be attributed to its ability to leverage shared information across data points within the same time sequence.

Another aspect worth considering is that, in our dynamic setup, we employed all time frames per slice as input. This introduced GPU memory limitations, thereby constraining the parameter count in the reconstruction model (3D vSHARP). However, by utilizing only a subset of the time sequence data (e.g., 2–3 adjacent time frames), it would be feasible to construct a larger model.

Furthermore, Table 2 shows that the 2D dynamic reconstruction setup requires less inference time. This can be attributed to the fact that the 2D reconstruction process involves loading individual slices or time frames into memory and subsequently performing a forward pass through the model. This leads to relatively longer reconstruction times, as evidenced by the higher values for both the Cine and Multi-Contrast datasets. Conversely, in the 2D dynamic reconstruction setup, sequences of data are loaded collectively and processed in a single forward pass through the 2D dynamic model, resulting in significantly reduced reconstruction times. This observation could indeed play a pivotal role in selecting an appropriate reconstruction model for real-time clinical scenarios.