1 Introduction

Dynamic Magnetic Resonance Imaging (MRI) is a non-invasive imaging technique to monitor dynamic processes such as cardiac motion by acquiring data in a k-t space that contains both temporal and spatial information. However, the acquisition speed is limited due to both physical and physiological constraints. It is well known that in dynamic MRI there exists significant correlations in k-space and time. In order to increase the acquisition rate, most strategies have been designed to acquire part of the desired k-t measurements and then reconstruct the images by exploiting spatio-temporal redundancies within the data.

Inspired by traditional k-t methods from the area of compressed sensing [8, 9, 15] for accelerated dynamic MR imaging, here we propose a novel dynamic MR image reconstruction NEtwork with X-f Transform, termed k-t NEXT, which exploits the signal redundancies in both x-f domain and image domain. In particular, the proposed k-t NEXT formulates the reconstruction process in an iterative fashion, where in each iteration, it consists of two sub-modules: a xf-CNN that learns to recover the true signals from aliased signals in x-f domain, and a convolutional recurrent neural network (CRNN) that exploits spatio-temporal redundancies in image domain. The dynamic reconstruction process thus alternates between x-f space and image space, which potentially enables the network to learn complementary features simultaneously from both domains. Experiments were performed on highly undersampled short-axis cardiac cine MR scans, where we show that the proposed model outperforms the current state-of-the-art dynamic MR reconstruction methods.

1.1 Related Work

Over the years, a number of approaches have been proposed for the reconstruction of accelerated dynamic MR images. In general, these methods can be mainly divided into three categories, based on exploiting correlations in k-space, in time, and in both k-space and time [15]. The first class of approaches exploit the correlations between k-space points at the same time frame, and then reconstruct each frame independently from other time frames, such as reduced field-of-view (FOV) [6] and parallel imaging methods [3], while the second group of strategies is to exploit redundancies in time, where the missing data at a given position can be interpolated or extrapolated from the measured data at other time points, such as keyhole imaging [7] and data sharing [18]. Relevant to our method, the third type of approaches is based on exploiting correlations in both k-space and time. One of the examples is the model-based k-t BLAST and k-t SENSE method [15], which takes advantage of a-priori information about the x-f support obtained from the training stage and then to remedy the aliasing artefacts during acquisition stage. Based on that, k-t FOCUSS [8, 9] then formulated the problem in a compressed sensing MRI framework, which enforced the sparsity in x-f domain for the signal recovery. Similarly, a low rank and sparse reconstruction scheme (k-t SLR) [10] was proposed to exploit correlations between the temporal profiles of the voxels by introducing non-convex spectral norms and spatio-temporal total variation norm. In more recent years, deep learning approaches have gained their popularity for MR image reconstruction [2, 12, 13, 16]. Most approaches investigate on exploiting information in a single frame (or static image) either in image domain [4, 11, 14] or in k-space domain [1, 5, 17], where each frame (or image) is reconstructed independently. In order to exploit the temporal redundancies, Schlemper et al. [13] proposed a data sharing (DS) layer in an image space cascaded 3D convolutional network to utilise the similar information contained in neighboring k-space samples. Qin et al. [12] also proposed a bidirectional CRNN model to exploit the temporal dependencies of dynamic sequences in image domain. In contrast, our approach proposes to reconstruct the images in both x-f and image domains, where complementary information from two different domains can be fully exploited.

2 Methods

2.1 Problem Formulation

Consider a Cartesian k-space trajectory where \(k_x\) denotes the phase encoding direction, \(k_y\) denotes the readout direction, while \(\sigma (x,t)\) denotes the image domain content at x and time t. The k-space measurement v(kt) is then formulated as:

$$\begin{aligned} v(k,t)=\int \sigma (x,t) e^{-j2\pi kx}dx = \int \int \rho (x,f)e^{-j2\pi (kx+ft)}dx\,df, \end{aligned}$$
(1)

where \(\rho (x,f)\) is the 2D spectral signal in x-f domain. This can also be represented in a matrix form: \(\mathbf{{v}} = \mathcal {F} \mathbf{{\rho }}\), in which \(\mathbf {v}\) and \(\mathbf {\rho }\) stand for the stacked k-t space measurement vectors and x-f image respectively, and \(\mathcal {F}\) is the 2D Fourier transform along the x-f direction. From the perspective of compressed sensing, the problem can be formulated by exploiting the sparsity of the unknown signal:

$$\begin{aligned} \text {min} \ ||{\mathbf {\rho }}||_{1}, \quad s.t.\ ||\mathbf {v}-\mathcal {F} \mathbf {\rho }||_{2}\le \epsilon , \end{aligned}$$
(2)

where \(\epsilon \) denotes the noise level. In k-t FOCUSS [8, 9], the underdetermined inverse problem was solved via a sparse reconstruction algorithm called FOCUSS. The solution then can be expressed as the form that consists of a baseline signal \(\bar{\rho }\) and its residual encoding for the n-th estimate of the x-f signal \(\rho ^{(n)}\):

$$\begin{aligned} \rho ^{(n)} = \bar{\rho } + \text {FOCUSS}(\rho ^{(n-1)}-\bar{\rho }, \rho ^{(n-1)}). \end{aligned}$$
(3)

Here the mathematical form of FOCUSS algorithm is omitted for simplicity. For details, please refer to [8, 9].

2.2 k-t NEXT for Dynamic MRI Reconstruction

Motivated by k-t BLAST [15] and k-t FOCUSS [9], we propose a dynamic image reconstruction NEtwork with X-f Transform (k-t NEXT) to exploit the spatio-temporal correlations from both x-f space and image space. Specifically, k-t NEXT formulates the iterative reconstruction process in an unfolded cascading way, as it has been shown to be a powerful technique in MR reconstruction [12, 13]. In each iteration, our proposed approach learns to reconstruct the true images by alternating between x-f and image spaces, so that the spatio-temporal redundancies can be jointly exploited from these two complementary domains. In particular, a xf-CNN is proposed for the recovery of signals in x-f domain inspired by the traditional k-t method, and a variation of the CRNN-MRI [12] network is adopted for the subsequent image space reconstruction. We can compactly represent a single iteration of the k-t NEXT as follows:

$$\begin{aligned} \rho ^{(n)}&= \text {DC}(\bar{\rho }_{rec}^{(n-1)}) + xf\text {-CNN}(\rho _{rec}^{(n-1)}-\bar{\rho }_{rec}^{(n-1)}), \end{aligned}$$
(4a)
$$\begin{aligned} \mathbf {\sigma }_{rec}^{(n)}&=\text {CRNN}(\mathcal {F}_f\rho ^{(n)}; \mathbf {v}^{(0)}), \quad \rho _{rec}^{(n)} = \mathcal {F}_f^H \mathbf {\sigma }_{rec}^{(n)}, \end{aligned}$$
(4b)

where \(\mathbf {\sigma }_{rec}^{(n)} \in \mathbb {C}^D\) denotes the complex-valued reconstructed image sequence at iteration n, and \(\sigma _{rec}^{(0)}=\sigma _u\) is the acquired zero-filled undersampled images. Here \(D=D_xD_yT\), in which \(D_x\) and \(D_y\) are width and height of the frame and T is the number of frames. \(\mathcal {F}_f\) denotes the Fourier transform along f dimension, and \(\rho _{rec}^{(n)}\) is the x-f spectral signal transformed from \(\sigma _{rec}^{(n)}\), while \(\rho ^{(n)}\) stands for the intermediate reconstructed signal from xf-CNN. Also \(\bar{\rho }_{rec}^{(n-1)}\) denotes the temporally averaged x-f signal (see Eq. (5)), DC stands for the data consistency layer [13], and \(\mathbf {v}^{(0)}\in \mathbb {C}^M\) (\(M \ll D\)) is the acquired raw data. An illustrative diagram of k-t NEXT is shown in Fig. 1. We will introduce it in the following.

Fig. 1.
figure 1

The k-t NEXT reconstruction diagram. True signals can be recovered by iteratively updating the reconstruction in both (a) x-f and (b) image domains via learning the xf-CNN and CRNN jointly. For mathmetical notations, please refer to Eq. 4.

xf-CNN Exploiting Spatio-Temporal Correlations in x-f Domain. Following the formulation in Eq. (3), here we propose to formulate the xf-CNN reconstruction as Eq. (4a), where instead of using model-based [15] or compressed sensing [9] algorithms to recover the true signals, we employ a stack of CNN layers to estimate the missing data based on other available points, typically within its vicinity in x-f space. In particular, here the x-f baseline signal \(\bar{\rho }_{rec}^{(n)}\) is a temporal average of a sequence, i.e.,

$$\begin{aligned} \bar{\rho }_{rec}^{(n)}=\mathcal {F}^H \left[ \sum _t \mathbf {v}^{(n)}./\text {max}(\mathbf {1}, \sum _{t} \delta (\mathbf {v}^{(n)}))\right] , \quad \delta (a)= {\left\{ \begin{array}{ll} 0&{} a=0\\ 1&{} a \ne 0 \end{array}\right. } \end{aligned}$$
(5)

in which \(\mathbf {v}^{(n)}\) is the k-space data that is Fourier transformed from \(\sigma _{rec}^{(n)}\), and the ./ and max operation is performed element-wise. Thereby, xf-CNN learns to reconstruct residuals of each frame, which further exploits the signal sparsity.

The illustrative diagram of x-f reconstruction is shown in Fig. 1(a). Specifically, we formulate the k-t to x-f transformation process as a x-f transform layer in the network. In details, the x-f transform layer receives input from k-t space data. For iteration n, the acquired k-space data is firstly averaged along t to yield a temporal average (Eq. (5)), which is then subtracted from data at each time frame. To ensure data fidelity for the baseline estimate, here we propose to incorporate a data consistency (DC) term for \(\bar{\rho }_{rec}^{(n-1)}\) at each frame separately. Then the subtracted data and temporally averaged data are inverse Fourier transformed to image space to obtain a sequence of aliased images and a data-consistent temporally averaged sequence. Each frequency-encoding position is then processed separately hereafter. The image columns from aliased images or baseline images are then gathered and inverse Fourier transformed along t to yield an x-f image, corresponding to \(\rho _{rec}^{(n-1)}-\bar{\rho }_{rec}^{(n-1)}\) and DC(\(\bar{\rho }_{rec}^{(n-1)}\)) respectively, which are then fed as inputs to xf-CNN for x-f space reconstruction (Eq. (4a)). After the signal de-aliasing in x-f domain, another Fourier transform along f is adopted to transform the estimated x-f signal \(\rho ^{(n)}\) back to dynamic image space for the subsequent image space reconstruction (Eq. (4b)).

k-t NEXT Exploiting Spatio-Temporal Redundancies in Complementary Domains. Previous approaches [2] have shown that exploring cross-domain knowledge is beneficial for MR reconstruction task. Inspired by this, with the aim of exploiting redundancies in complementary domains, here we propose to learn a dynamic MR reconstruction network in both x-f and image spaces jointly. In particular, we employ the CRNN model for image space reconstruction due to its effectiveness in exploiting temporal redundancies with a relatively smaller network capacity [12]. Thus, in each cascade, the proposed k-t NEXT consists of a xf-CNN and a CRNN block, where it employs all 2D convolutions across spatial and temporal dimensions, in contrast to 3D convolutions used in the baseline method [13]. This enables the network to be more efficient and effective in learning useful and complementary features in x-f, spatial and temporal space simultaneously.

Given the training data S with undersampled data as input and fully sampled data as target, i.e., \(({\mathbf{{\sigma }}_u},{\mathbf{{\sigma }}_t})\) in image space and \((\mathbf{{\rho }}_{u}, \rho _{t})\) in x-f space, the network is trained end-to-end by minimising the pixel-wise mean squared error (MSE) between the reconstructed data and the ground truth fully sampled data:

$$\begin{aligned} \mathcal {L}\left( \varvec{\theta } \right) \mathrm{{ = }}\frac{1}{n_S}\sum \limits {\left( \left\| {{\mathbf{{\sigma }}_t} - {\mathbf{{\sigma }}_{rec}^{(N)}}} \right\| _2^2+\left\| {{\mathbf{{\rho }}_t} - {\mathbf{{\rho }}^{(N)}}} \right\| _2^2\right) }, \end{aligned}$$
(6)

where \({\mathbf{{\sigma }}_{rec}^{(N)}}\) and \({\mathbf{{\rho }}^{(N)}}\) denote the predicted image and x-f array at iteration N, i.e., the final output in image domain and x-f domain respectively, \(\varvec{\theta }\) is the set of network parameters, and \({n_S}\) is the number of training samples.

3 Experiments and Results

3.1 Dataset and Implementation Details

The dataset used in our experiments consists of 10 fully sampled complex-valued short-axis cardiac cine MRI. Each scan contains a single slice SSFP acquisition with 30 temporal frames. The raw data has 32-channel data with sampling matrix \(192 \times 190\), which was zero-filled to \(256 \times 256\), and the raw multi-coil data was then reconstructed to produce a single complex-valued image. In experiments, images were transformed back to k-space to simulate a fully sampled single-coil acquisition. A shear grid k-t Cartesian sampling pattern with four central lines (see Fig. 3(b)) was employed to undersample the k-space data to generate the undersampled input image sequences. The undersampling rate mentioned is stated with respect to the matrix size of the data, which is \(192 \times 190\).

In the proposed k-t NEXT, xf-CNN is composed of 5 layers of 2D CNN with a residual connection from the baseline estimate. For the CRNN model, a variation of architecture [12] is employed which consists of 4 layers of bidirectional CRNN, 1 layer of 2D CNN, a residual connection and a DC layer. We used dilated convolutions with kernel size \(3 \times 3\) and dilation factor (3, 3), and the number of cascade N was set to 4 for all comparison methods. For detailed network architecture, please refer to supplementary materials. The network was implemented in PyTorch. During training, ADAM optimiser was employed with a learning rate of \(10^{-4}\). Data augmentation was performed on-the-fly, with random rotation, scaling, and elastic transformation. All evaluations were done via a 3-fold cross validation.

Table 1. Comparison results of different methods on dynamic cardiac cine MRI with high undersampling rate 9 and 12. Best results are indicated in bold.
Fig. 2.
figure 2

Comparison results on spatial and temporal dimensions with their error maps. A dynamic video is shown in supplementary materials for better visualisation.

Fig. 3.
figure 3

Visualisation in x-f domain. (a) Ground Truth (b) k-t sampling pattern (c) \(9 \times \) undersampled data (d) Reconstructed x-f image (e) Error between (c) and (d).

3.2 Results

In experiments, we compared our proposed approach (k-t NEXT) with different dynamic MR reconstruction methods, including compressed sensing method k-t FOCUSS [9], deep learning method CRNN-MRI [12], and DS+3DCNN [13] that incorporates data sharing (DS). To investigate the effectiveness of xf-CNN, an additional baseline approach is proposed which replaces all x-f reconstruction in k-t NEXT with DS component, termed DS+CRNN. In DS methods, we set the number of neighbouring frame as \(n_{adj} \in \{0,1,...5\}\) as in [13]. Note that for a fair comparison with our k-t NEXT, we modified the baseline approaches DS+3DCNN and DS+CRNN to learn the residual of a temporally averaged frame as well. Quantitative comparison results of different methods on dynamic cardiac data with undersampling rates 9 and 12 are presented in Table 1, where it compares the network capacity per cascade, peak-to-noise-ratio (PSNR), structural similarity index (SSIM) and high frequency error norm (HFEN) [12]. Networks for different undersampling factors were trained separately in this case. It can be seen that our proposed k-t NEXT can outperform other baseline methods by a large margin in terms of all these measures at different undersampling rates, with roughly the same level of network capacity. In particular, k-t NEXT performs better than its corresponding DS pair, which indicates the merits of exploiting correlations in x-f space and complementary domains.

Additionally, we compared the qualitative results on \(9 \times \) undersampled data in Fig. 2, where it shows the reconstructed images along both spatial and temporal dimensions, as well as their corresponding error maps. It can be observed that our proposed model can faithfully recover the images with smaller errors especially around dynamic regions compared with other baseline methods. In particular, k-t NEXT produced visually sharper images than DS methods. This is reflected by the fact that, in contrast to DS approaches which fill in k-space data from neighboring frames and therefore could possibly generate averaged and smooth images, k-t NEXT directly estimates the missing data in x-f space. A visualisation of x-f reconstruction is also presented in Fig. 3, where it displays the reconstructed x-f image and its error map in comparison to the input aliased data. It can be observed that the aliasing artefacts were largely removed and the undersampled data were recovered to approximate the ground truth signals.

4 Conclusion

In this paper, we have presented a novel deep learning based method, k-t NEXT (k-t NEtwork with X-f Transform), for highly undersampled dynamic MR image reconstruction. xf-CNN is proposed to exploit correlations in k-t space via reconstructing the true signals from aliased signals in x-f domain. Based on that, k-t NEXT is then proposed to learn to iteratively recover the images by alternating between the complementary x-f and image domains, where networks from both domains were trained jointly. Experimental results have shown that the proposed k-t NEXT outperforms state-of-the-art dynamic MR reconstruction methods in terms of both quantitative and qualitative performance. For the future work, we will extend the method for dynamic 3D applications.