Keywords

1 Introduction

Magnetic Resonance Imaging (MRI) has been widely used to examine almost any part of the body since it can depict the structure inside the human non-invasively and produce high contrast images. Notably, cardiac MRI (CMR) assessing cardiac structure and function plays a key role in evidence-based diagnostic and therapeutic pathways in cardiovascular disease [13], including the assessment of myocardial ischemia, cardiomyopathies, myocarditis, congenital heart disease [14]. However, obtaining high-resolution CMR is time-consuming and high-cost as it is sensitive to the changes in the cardiac cycle length and respiratory position [23], which is rarely clinically applicable.

To address this issue, the single image super-resolution (SISR) technique, which aims at reconstructing a high-resolution (HR) image from low-resolution (LR) one, holds a great promise that does not need to change the hardware or scanning protocol. Most of the MRI SISR approaches [3, 21, 24] are based on the deep learning-based methods [5, 16], which learn the LR-HR mapping with extensive LR-HR paired data. On the other hand, several previous studies [11, 31] adapt the self-similarity based SISR algorithm [8], which does not need external HR data for training. However, straightforwardly employing the aforementioned methods is not appropriate for CMR video reconstruction since the relationship among the consecutive frames in CMR video is not well considered. Therefore, we adopt the video super-resolution (VSR) technique, which can properly leverage the temporal information and has been applied in numerous works [7, 10, 22, 27, 30], to perform CMR video reconstruction.

Fig. 1.
figure 1

We present efficient post-processing to facilitate the acquisition of high-quality cardiac MRI (CMR) that is conventionally time-consuming, high-cost, and sensitive to the changes in the cardiac cycle length and respiratory position [23]. Specifically, we utilize the domain knowledge and iteratively enhance low-resolution CMR by a neural network, which can reduce the scan time and cost without changing the hardware or scanning protocol.

In this work, we propose an end-to-end trainable network to address CMR VSR problem. To well consider the temporal information, we choose ConvLSTM [28], which has been proven effective [6, 9], as our backbone. Moreover, we introduce the domain knowledge (i.e., cardiac phase), which has shown to be important for the measurement of the stroke volume [15] and disease diagnosis [29], to provide the direct guidance about the temporal relationship in a cardiac cycle. Combined with the proposed phase fusion module, the model can better utilize the temporal information. Last but not the least, we devise the residual of residual learning inspired by the iterative error feedback mechanism [2, 19] to guide the model iteratively recover the lost details. Different from other purely feed-forward approaches [10, 18, 22, 27, 30], our iterative learning strategy can make the model easier in representing the LR-HR mapping with fewer parameters.

We evaluate our model and multiple state-of-the-art baselines on two synthetic datasets established by mimicking the acquisition of MRI [4, 31] from two publicly datasets [1, 26]. It is worth noting that one of them is totally for external evaluation. To properly assess the model performance, we introduce the cardiac metrics based on PSNR and SSIM. The experimental results turn out that the proposed network can stand out from existing methods even on the large-scale external dataset, which indicates our model has the generalization ability. To our best knowledge, this work is the pioneer to address the CMR VSR problem and provide a benchmark to facilitate the development in this domain.

Fig. 2.
figure 2

Model overview. The bidirectional ConvLSTM [28] utilizes the temporal information from forward and backward directions. The phase fusion module exploits the informative phase code to leverage the bidirectional features. With the residual of residual learning, the network recovers the results in a coarse-to-fine fashion. Auxiliary paths are adopted for stabilizing the training procedure.

2 Proposed Approach

Let \(I_{LR}^t\) \(\in \mathbb {R}^{H \times W}\) denote the t-th LR frame obtained by down-sampling the original HR frame \(I_{HR}^t\) \(\in \mathbb {R}^{rH \times rW}\) with the scale factor r. Given a sequence of LR frames denoted as {\(I_{LR}^t\)}, the proposed end-to-end trainable model aims to estimate the corresponding high-quality results {\(I_{SR}^t\)} that approximate the ground truth frames {\(I_{HR}^t\)}. Besides, \(\oplus \) refers to the element-wise addition.

2.1 Overall Architecture

Our proposed network is illustrated in Fig. 2. It consists of a feature extractor, a bidirectional ConvLSTM [28], a phase fusion module, and an up-sampler. The feature extractor (FE) first exploits the frame \(I_{LR}^t\) to obtain the low-frequency feature \(L^t\). Subsequently, the bidirectional ConvLSTM [28] comprising a forward ConvLSTM (\(ConvLSTM_F\)) and a backward ConvLSTM (\(ConvLSTM_B\)) makes use of the low-frequency feature \(L^t\) to generate the high-frequency features \(H^t_F, H^t_B\). With the help of its memory mechanism, the bidirectional ConvLSTM can fully utilize the temporal relationship among consecutive frames in both directions. In addition, we can update the memory cells in the bidirectional ConvLSTM in advance instead of starting with the empty states due to the cyclic characteristic of the cardiac videos. This can be done by feeding n consequent updated frames before and after the input sequence {\(I^t_{LR}\)} to the network.

Furthermore, to completely integrate the bidirectional features, the designed phase fusion module (PF) applies the cardiac knowledge of the \(2N+1\) successive frames from \(t-N\) to \(t+N\) in the form of the phase code \(P^{[t-N:t+N]}\), which can be formulated as \(H_P^t = PF(H^{[t-N:t+N]}_F, H^{[t-N:t+N]}_B, P^{[t-N:t+N]})\), where \(H_P^t\) represents the fused high-frequency feature. After that, the fused high-frequency feature \(H_P^t\) combined with the low-frequency feature \(L^t\) through the global skip connection is up-scaled by the up-sampler (Up) into the super-resolved image \(I^t_{SR} = Up(H_P^t \oplus L^t)\). We further define the sub-network (\(Net_{sub}\)) as the combination of \(ConvLSTM_F, ConvLSTM_B\) and PF. The purpose of \(Net_{sub}\) is to recover the high-frequency residual \(H_P^t = Net_{sub}(L^t)\). Besides, we employ the deep supervision technique [17] to provide the additional gradient signal and stabilize the training process by adding two auxiliary paths, namely \(I^t_{SR, F} = Up(H_F^t \oplus L^t)\) and \(I^t_{SR, B} = Up(H_B^t \oplus L^t)\). Finally, we propose the residual of residual learning that progressively restores the residual that has yet to be recovered in each refinement stage \(\omega \). To simplify the notation, \(\omega \) is omitted when it equals to 0, e.g., \(L^t_F\) means the low-frequency feature of the t-th frame at the 0-th stage \(L^{t, 0}_F\).

Fig. 3.
figure 3

Proposed components. (a) Phase code formulated as the periodic function contains domain knowledge (i.e., cardiac phase). (b) Phase fusion module can realize the phase of the current sequence with the cardiac knowledge to thoroughly integrate the bidirectional features. (c) Residual of residual learning aims at directing the model to reconstruct the results in a coarse-to-fine manner.

2.2 Phase Fusion Module

The cardiac cycle is a cyclic sequence of events when the heart beats, which consists of systole and diastole process. Identification of the end-systole (ES) and the end-diastole (ED) in a cardiac cycle has been proved critical in several applications, such as the measurement of the ejection fraction and stroke volume [15], and disease diagnosis [29]. Hence, we embed the physical meaning of the input frames into our model with the informative phase code generated by projecting the cardiac cycle to the periodic Cosine function as depicted in Fig. 3a. Specifically, we map the process of the systole and the diastole to the half-period cosine separately:

$$\begin{aligned} P^t = {\left\{ \begin{array}{ll} Cos(\pi \times \frac{t-ED}{ES-ED}), &{} \text {if } \; \text {ED} < t \le \text {ES}\\ Cos(\pi \times (1+\frac{(t-ES)\%T}{T-(ES-ED)})), &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where % denotes modulo operation and T is the frame number in a cardiac cycle.

The overview of the proposed phase fusion module is shown in Fig. 3b. The features from the bidirectional ConvLSTM with the corresponding phase code are concatenated and fed into the fusion module. With the help of consecutive \(2N+1\) phase codes, it can link the same-position frames from different periods (inter-period). Besides, it can realize the heart is relaxing or contracting as the phase code is respectively increasing or decreasing (intra-period).

2.3 Residual of Residual Learning

In the computer vision field, the iterative error-correcting mechanism plays an essential role in several topics, such as reinforcement learning [19], scene reconstruction [20], and human pose estimation [2]. Inspired by this mechanism, we propose the residual of residual learning composing the reconstruction process into multiple stages, as shown in Fig. 3c. At each stage, the sub-network (\(Net_{sub}\)) in our model estimates the high-frequency residual based on the current low-frequency feature, and then the input low-frequency feature is updated for the next refinement stage. Let \(L^{t, 0}\) be the initial feature from the feature extractor (FE) and \(L^{t, \omega }\) denote the updated feature at the iteration \(\omega \), the residual of residual learning for \(\varOmega \) stages can be described as the recursive format:

$$\begin{aligned} L^{t, \omega } = {\left\{ \begin{array}{ll} FE(I^t_{LR}), &{} \text {when }\omega = 0 \\ L^{t, \omega -1} \oplus Net_{sub}(L^{t, \omega -1}), &{} \text {if } 0 < \omega \le \varOmega \end{array}\right. } \end{aligned}$$
(2)

Then, the network generates the super-resolution result \(I^{t, \omega }_{SR}\) based on the current reconstructed feature \(L^{t, \omega }\), which can be written as:

$$\begin{aligned} I^{t, \omega }_{SR} = Up(L^{t, \omega } \oplus Net_{sub}(L^{t, \omega })) \end{aligned}$$
(3)

The model progressively restores the residual that has yet to be recovered in each refinement stage, which is so-called the residual of residual learning. Compared to other one-step approaches [10, 18, 22, 27, 30], the proposed mechanism tries to break down the ill-posed problem into several easier sub-problems in the manner of divide-and-conquer. Most notably, it can dynamically adjust the iteration number depending on the problem difficulty without any additional parameters.

2.4 Loss Function

In this section, we elaborate on the mathematical formulation of our cost function. At each refinement stage \(\omega \), the super-resolved frames {\(I^{t, \omega }_{SR}\)} are supervised by the ground-truth HR video {\(I^t_{HR}\)}, which can be formulated as \(\mathcal {L}^\omega = \frac{1}{\tilde{T}}\sum _{t=1}^{\tilde{T}}\ \parallel I^{t, \omega }_{SR} - I^t_{HR} \parallel _1\), where \(\tilde{T}\) indicates the length of the video sequence fed into the network. We choose the L1 loss as the cost function since the previous works have demonstrated that the L1 loss provides better convergence compared to the widely used L2 loss [18, 32]. Besides, we apply the deep supervision technique as described in Sect. 2.1 by adding two auxiliary losses \(\mathcal {L}_F^\omega = \frac{1}{\tilde{T}}\sum _{t=1}^{\tilde{T}}\ \parallel I^{t, \omega }_{SR, F} - I^t_{HR} \parallel _1\) and \(\mathcal {L}_B^\omega = \frac{1}{\tilde{T}}\sum _{t=1}^{\tilde{T}}\ \parallel I^{t, \omega }_{SR, B} - I^t_{HR} \parallel _1\). Hence, the total loss function can be summarized as \(\mathcal {L} = \sum _{\omega =0}^{\varOmega } (\mathcal {L}^\omega + \mathcal {L}_F^\omega + \mathcal {L}_B^\omega )\), where \(\varOmega \) denoted as the total number of refinement stages.

Table 1. Quantitative results. The red and blue indicate the best and the second-best performance, respectively. We adopt CardiacPSNR/CardiacSSIM to fairly assess the reconstruction quality of the heart region. It is worth noting that the large-scale DSB15SR dataset is entirely for external evaluation.

3 Experiment

3.1 Experimental Settings

Data Preparation. To our best knowledge, there is no publicly available CMR dataset for the VSR problem. Hence, we create two datasets named ACDCSR and DSB15SR based on the public MRI datasets. One is the Automated Cardiac Diagnosis Challenge dataset [1], which contains four dimension MRI scans of a total of 150 patients. The other is the large-scale Second Annual Data Science Bowl Challenge dataset [26] composed of 2D cine MRI videos that contain 30 images across the cardiac cycle per sequence. We use its testing dataset comprising 440 patients as the external assessment to verify the robustness and generalization of the algorithms. To more accurately mimic the acquisition of LR MRI scans [4, 31], we project the HR MRI videos to the frequency domain by Fourier transform and filter the high-frequency information. After that, we apply the inverse Fourier transform to project the videos back to the spatial domain and further downsample by bicubic interpolation with the scale factor 2, 3, and 4.

Evaluation Metrics. PSNR and SSIM criteria have been widely used in previous studies to evaluate the SR algorithms. However, the considerable disparity of the proportion of the cardiac region to the background region in MRI images makes the results heavily biased towards the insignificant background region. Therefore, we introduce CardiacPSNR and CardiacSSIM to assess the performance more impartially and objectively. Specifically, we employ a heart ROI detection method similar to [25] to crop the cardiac region and calculate PSNR and SSIM in this region. This can reduce the influence of the background region and more accurately reflect the reconstruction quality of the heart region.

Training Details. For training, we randomly crop the LR clips of \(\tilde{T} = 7\) consecutive frames of size \(32\times 32\) with the corresponding HR clips. We experimentally choose \(n = 6\) and \(\varOmega = 2\) as detailed in Sect. 3.3, while \(N = 2\) in the phase fusion module. We use the Adam optimizer [12] with learning rate \(10^{-4}\) and set the batch size to 16. For other baselines, we basically follow their original settings except the necessary modifications to train them from the scratch.

Fig. 4.
figure 4

Experimental analysis. (a) Our network outperforms other baselines with fewer parameters and higher FPS. (b) The performance is progressively enhanced as n increases, which indicates that the prior sequence can provide useful information. (c) The performance can be improved with \(\varOmega \) increasing.

Table 2. Ablation study. Memory: the memory cells in the ConvLSTM [28] are activated; Updated memory: the memory cells are updated by feeding n consecutive frames; Bidirection: bidirectional ConvLSTM is adopted; Phase fusion module and Residual of residual learning: the proposed components are adopted.

3.2 Experimental Results

To confirm the superiority of the proposed approach, we compare our network with multiple state-of-the-art methods, namely EDSR [18], DUF [10], EDVR [27], RBPN [7], TOFlow [30], and FRVSR [22]. We present the quantitative and qualitative results in Table 1 and Fig. 5 respectively. Our approach outperforms almost all the existing methods by a huge margin in all scales in terms of CardiacPSNR and CardiacSSIM. In addition, our method can yield more clear and photo-realistic SR results which subjectively closer to the ground truths. Moreover, the results on the external DSB15SR dataset are sufficiently convincing to validate the generalization of the proposed approach. On the other hand, the comparison with regard to the model parameters, FPS, and the image quality in the cardiac region plotted in Fig. 4a demonstrates that our method strikes the best balance between efficiency and reconstruction performance.

Fig. 5.
figure 5

Qualitative results. Zoom in to see better visualization.

3.3 Ablation Study

We adopt the unidirectional ConvLSTM as the simplest baseline. As shown in the Table 2, the temporal information is important since the model performance is worse when the memory cells in ConvLSTM are disabled. As the cardiac MRI video is cyclic, we can refresh the memory by feeding n successive frames. Accordingly, we analyze the relation between n and model performance. The result in Fig. 4b turns out that the network significantly improves as the updated frame number increases. Moreover, the forward and backward information is shown to be useful and complementary for recovering the lost details.

In Sect. 2.2, we exploit the knowledge of the cardiac phase to better fuse the bidirectional information. The result in Table 2 reveals that the phase fusion module can leverage the bidirectional temporal features more effectively. Besides, we explore the influence of the total number of refinement stages \(\varOmega \) in the residual of residual learning. It can be observed from Fig. 4c that the reconstruction performance is improved as the total refinement stages continue to increase. The possible reason for the saturation or degradation of the overall performance when \(\varOmega \) equals to 3 or 4 is overfitting (violate the Occam’s razor).

4 Conclusion

In this work, we define the cyclic cardiac MRI video super-resolution problem which has not yet been completely solved to our best knowledge. To tackle this issue, we bring the cardiac knowledge into our network and employ the residual of residual learning to train in the progressive refinement manner, which enables the model to generate sharper results with fewer model parameters. In addition, we build large-scale datasets and introduce cardiac metrics for this problem. Through extensive experiments, we demonstrate that our network outperforms the state-of-the-art baselines qualitatively and quantitatively. Most notably, we carry out the external evaluation, which indicates our model exhibits good generalization behavior. We believe our approach can be seamlessly applied to other modalities such as computed tomography angiography and echocardiography.