Keywords

1 Introduction

Magnetic resonance imaging (MRI) is widely used to assess cardiac function for cardiovascular disease diagnosis. Cardiac motion estimation highlights regional deformation of the myocardium, which is related to the severity of cardiovascular disease. Cardiac motion can be determined from the displacement field in MRI. Moreover, cardiac motion estimation can be regarded as an image registration problem. Shen et al. [8] proposed a spatio-temporal 4D deformable registration method for cardiac motion estimation in MR image sequences. De Craene et al. [3] estimated motion and strain in 3D echocardiography by finding the 4D velocity field with spatio-temporal B-Spline kernels.

In recent years, deep learning-based methods have achieved promising results for deformable registration-based motion characterization. Zheng et al. [10] estimated cardiac motion using a variant of U-Net [7] with a semi-supervised learning strategy. Qin et al. [6] suggested a Siamese style recurrent spatial transformer network for cardiac motion estimation, to guide cardiac segmentation. Both of these works required expert manual segmentation of the left ventricle.

A major challenge is to estimate the effect of cardiac functional changes via automated cardiac motion analysis. The early onset of symptoms already causes an increased strain on the heart, but the strain-related changes are not always easy to see by eye until more significant cardiac structural changes occur. Motion-characteristic features, such as time series of the endocardial radius, thickness, circumferential strain (Ecc) and radial strain (Err) are related to cardiac disease and they are easy to explain as characteristics of pathological cardiac motion. Motion analysis is therefore also useful for early stage characterization of disease.

In this paper, we propose a deep learning-based architecture with a self-supervised strategy to characterise the spatio-temporal patterns of left ventricular (LV) cardiac motion in cardiac MR cine loops for improving the characterization of heart conditions. We compare the proposed method with two other state-of-the-art methods. Specifically, we extract motion-characteristic features and time series of the endocardial radius, thickness, Ecc and Err, based on the output dense displacement field (DDF) of the proposed method, and compare these features between a healthy group and a primary pulmonary hypertension (PPH) pathological group.

Contributions. The contributions of this work are as follows. (1) To our knowledge, this is the first attempt to exploit \(2D +t\) spatio-temporal patterns with convolutional Long Short-Term Memory (ConvLSTM) in LV cardiac motion with a self-supervised strategy. (2) The predicted DDF of this method can be used to determine motion-characteristic features, namely a time series of the endocardial radius, thickness, Ecc and Err. These features are able to characterize different cardiac motion in health and pathologies. (3) We demonstrate that spatio-temporal patterns achieve better performance than the spatial-only pattern for cardiac motion estimation and regional analysis of LV function.

Fig. 1.
figure 1

Network Overview. A sequence of image pairs \(\{(I_{0} , I_{t} )\}_{t=1,2,3,...,n}\) is given as input to the U-Net convolutional network. The output of the U-Net, an initial dense displacement field (DDF), is fed to the convolutional LSTMs (ConvLSTM) to update the hidden states. The final output (predicted DDF) is used in subsequent analysis.

2 Spatio-Temporal Network

In this paper, cardiac motion estimation is considered as an image registration problem. The goal then becomes to estimate the spatial transformation of each point in the cardiac structure over the whole cardiac cycle. Let \(\{I_{t}\}_{t=0,1,2,...,N}\) indicate the cardiac MR cine loop frames, where N is the total number of frames. Each pixel-wise point \(x_{0}\) from the end-diastole (ED) frame \(I_{0}\) corresponds to a certain point \(x_{t}\) at the time frame t. In image registration, \(I_{t}(x_t)\) and \(I_{0}(T(x_{0}))\) denote the pixel value at same physical location. The spatial transformation T is represented by a DDF, described as \(u_{t}\) where \(u_{t}(x_{0}) = x_{t} - x_{0}\).

We model a function \(g_{\theta }(I_{0}, I_{t}) = u_{t}\) using a deep learning architecture, where \(\theta \) are the optimal parameters of the architecture that can be trained by optimising a function that considers the similarity of the source-target image pair \((I_{0}, I_{t})\) and a spatio-temporal smoothness constraint. We estimate the motion from the ED frame \(I_{0}\) to all other time frames \(I_{t}\), and generate a new image sequence \(\{I^{'}_{t}\}_{t=0,1,2,...,N}\). The complete pipeline of the proposed architecture is presented in Fig. 1, and is described in Sect. 2.1.

2.1 Network Architecture

Our deep learning architecture is a combination of a fully convolutional network (FCN) and a recurrent neural network (RNN). We describe the function of the FCN and RNN as follows.

U-Net. The FCN component explores the spatial information in each 2D slice (intra-slice information). U-Net [7] is employed due to its well-known ability to represent image features for biomedical image segmentation. It consists of encoder and decoder parts with skip connections. The U-Net detail is shown in the middle part of Fig. 1. A sequence of source-target image pairs \(\{(I_{0}, I_{t})\}_{t=1,2,3,...,N}\) is input to the U-Net convolutional network. The image pair is concatenated into a 2-channel 2D image. The encoder uses blocks of the 2D convolutional layers (\(3 \times 3\) kernel size), 2D batch normalization, rectified linear unit (ReLu) and 2D max pooling layer (\(2 \times 2\) window size). The decoder uses blocks of the transposed 2D convolutional layers (\(2 \times 2\) kernel size), 2D batch normalization and ReLu. The output of the U-Net is an initial dense displacement field (DDF), which is fed to initialise the LSTM to update the hidden states.

Convolutional LSTMs. The RNN component learns temporal relationships along the timeline (inter-slice information). We stack multiple convolutional LSTMs (ConvLSTM) [9], in order to increase the likelihood of detecting long-term dependencies of the cardiac motion over the cardiac cycle. We ran our architecture with different numbers of layers and kernel sizes in the ConvLSTM. Based on the validation performance, we stack 2 ConvLSTM layers with a 3-pixel kernel size in each layer. The number of input channels and the number of hidden channels of the ConvLSTM are each 2, where information in one channel represents the displacement in the x direction and in the other represents the displacement in the y direction.

The ConvLSTM can learn which information to keep in the long-term state, which information to drop, and which information to read. We present the details of the LSTM in Fig. 1. Let the current input be \(X_{T}\), and the previous hidden state is \(H_{T-1} \). Then,

$$\begin{aligned} \begin{array}{l} I_{T} = \sigma (W_{XI} *X_{T} + W_{HI} * H_{T-1} + W_{CI} \circ C_{T-1} + B_{I}),\\ F_{T} = \sigma (W_{XF} *X_{T} + W_{HF} * H_{T-1} + W_{CF} \circ C_{T-1} + B_{F}),\\ C_{T} = F_{T} \circ C_{T-1} + I_{T} \circ tanh( W_{XC} * X_{T} + W_{HC} * H_{T-1} + B_{C} ),\\ O_{T} = \sigma (W_{XO} *X_{T} + W_{HO} * H_{T-1} + W_{CO} \circ C_{T} + B_{O}),\\ H_{T} = O_{T} \circ tanh( C_{T} ). \end{array} \end{aligned}$$

Here \( *\) is the convolution operator and \( \circ \) is the Hadamard product (also called element-wise product). \(W_{XI}, W_{XF}, W_{XO}, W_{XC}, W_{HI}, W_{HF}, W_{HO}\) and \( W_{HC} \) represent the convolutional filters. \(B_{I}, B_{F}, B_{O} \) and \(B_{C}\) are the biases for each layers. The input gate \( I_{T}\) controls which part of the new input information will be kept in the long-term state. The forget gate \(F_{T}\) decides which part of the long-term state is removed. The output gate \(O_{T}\) decides which part of the long-term state is read. \(C_{T} \) is the long-term state. The short-term state \( H_{T}\) is the motion state in cardiac MR cine loop frames and indicates the output - predicted DDF.

Loss Function. The loss function (L) is defined as the sum of an image intensity-based similarity loss \(L_m\) and a regularisation loss \(L_s\) on the predicted DDF displacements. Namely, \(\varvec{L} = \varvec{L_m} + \varvec{L_s}\). \(\varvec{L_m}\) measures the mean squared error between each pixel in the registered source image \(I_{0}^{'}\) and the target image \(I_{t}\). \(\varvec{L_m} = \frac{1}{N}\sum _{t=1}^{N}(I_t-I_{t}^{'})^2\). According to the spatial transformation network [4], \(I_{0}\) is transformed to \(I_{t}^{'}\) using bilinear sampling. The second term, \(L_s\), is the spatial and temporal smoothness penalty, which controls the variation of displacements over space and time via an approximated Huber loss [6]. Mathematically, \(\varvec{L_s} = \lambda _{1} \varvec{L_{spatial}} + \lambda _{2} \varvec{L_{temporal}}\), where \(\varvec{L_{spatial}}\) calculates first-order spatial derivatives and \( \varvec{L_{temporal}}\) calculates first-order temporal derivatives. \(\lambda _{1} \) and \(\lambda _{2} \) are regularization parameters which are chosen empirically.

2.2 The Regional Analysis of Left Ventricular Function

The high-level steps in regional analysis of LV function are summarised in Fig. 2. The segmentation mask of the ED frame is deformed to another frame based on the predicted DDF. Automatic post-processing is applied to identify the LV endocardial and epicardial borders. To smooth the borders of deformed masks on the mid-slice 6-segments model of the 17-Segment AHA model [2], we performed a morphological closing operation (kernel size = 2) on them.

We divide the resulting predicted myocardium mask into segments based on the 17-Segment Model (AHA). Firstly, we find the barycenter of the LV and the right ventricle (RV) in the middle slice of the short axis view image. Secondly, we define the straight line between these two points as the initial line. Thirdly, we rotate this initial line around the barycenter point of the LV by 60, 120, 180, 240, \(300^\circ \) and divide the middle slice into 6 segments. Morphological transformations and barycenter location are implemented using OpenCV. The time series of the endocardial radius, thickness, Ecc and Err are measured in these 6 segments. In each segment, mean and standard deviation are used to show the rich detail. To this aim, we sample all the points on the endocardial border for the endocardial radius, 5 points by every \(12^\circ \) for the thickness and Err. Considering the small perimeter on the end-systolic (ES) frame, we divide the endocardial border into 3 sets instead of 5 sets for Ecc.

Fig. 2.
figure 2

Overview of the proposed framework for quantifying cardiac motion. The predicted DDF is applied to deform the segmentation mask of the ED frame from which the regional analysis of left ventricular endocardial radius, thickness, circumferential strain (Ecc) and radial strain (Err) can be estimated.

Strain Computation. Left ventricular strain indicates the deformation of the myocardium over the whole cardiac cycle and is shown in percentages. In each time frame T, circumferential strain (Ecc) and radial strain (Err) are computed as \(E = \frac{d_{T} - d_{ED}}{ d_{ED}} \times 100\% \). Here \(d_{ED}\) is the length on the ED frame, \(d_{T}\) is the length on the time frame T. In each sample, we choose the arc length of the endocardial border for the Ecc computation and LV wall thickness for the Err computation.

3 Experiments

3.1 Data Acquisition

Short-axis view cardiac MR image sequences from the UK BioBankFootnote 1 were used in this study. The CMR is obtained from a 1.5 T scanner (MAGNETOM Aera, Syngo Platform VD13A, Siemens Healthcare, Erlangen, Germany). A stack cine balanced steady-state free precession (bSSFP) of short-axis images, around 12 slices, covers the entire left and the right ventricles. In-plane resolution is \(1.8\times 1.8\) mm\(^{2}\), while the slice thickness is 8.0 mm and slice gap is 2.0 mm. Each sequence contains 50 consecutive time frames per cardiac cycle. We randomly selected image sequences of 450 subjects for training, 47 subjects for validation and 100 subjects for testing.

3.2 Implementation Details

Pre-processing. For training and testing the deep learning architecture, all images were cropped to a size of \(192 \times 192\) pixels because of GPU limitations, and the intensity normalisation applied to the cropped images. The segmentation mask of the LV endocardial and epicardial borders and the right ventricular (RV) endocardial borders at the ED frame was generated from using the FCN method proposed by Bai et al. [1] and used to quantify cardiac motion.

Training. The model is trained over 150 epochs using Adaptive Moment Estimation (Adam) optimisation [5] with learning rate 0.0001 and a batch size of 1. For the smoothness penalty of the loss, we set \(\lambda _{1} \) to 0.002 and \(\lambda _{2}\) to 0.0002 based on algorithm performance on the validation dataset. Further, we randomly select one frame in the selected slice to be frame \(I_{0}\). We set the input image sequence length to 20 frames due to GPU memory limitations. The proposed network was implemented using Python 3.7 with Pytorch. All the experiments are run with computational hardware GeForce GTX 1080 Ti GPU 10 GB.

3.3 Evaluation Metrics.

To quantify the similarity between the predicted image and the target image, we use three image metrics: the normalised root mean-squared error (NRMSE), the mean structural similarity index (MSSIM) and the peak signal to noise ratio (PSNR). A two-sided Wilcoxon signed rank test is used to find where there is a statistically significant difference in these three metrics among three methods.

4 Results

4.1 Quantitative Results

Table 1 summarizes the comparative results on the MRI sequences and the ES frame between the proposed and other methods. It is observed that the proposed method is superior to Qin et al.’s method [6] and U-Net [7]. The proposed method achieves an accuracy with a NRMSE of \(0.053 \pm 0.017\), MSSIM of \( 0.851 \pm 0.049\), and PSNR of \(35.391 \pm 2.976\) on the MRI sequences, and a NRMSE of \(0.065 \pm 0.012\), MSSIM of \(0.836 \pm 0.036\), and PSNR of \(33.399 \pm 1.120\) on the ES frame. U-Net yielded the lowest MSSIM and PSNR value and the highest NRMSE value on both the MRI sequences and the ES frame among the evaluated approaches. Using a two-sided Wilcoxon signed rank test, statistically significant greater results than Qin et al.’s and U-Net were obtained (\(p < 0.05\)) for all the measurements.

Table 1. Quantitative comparison on the MRI sequences and the ES frame between our method and two other methods, Qin et al.’s [6] and U-Net [7]. The results are presented as mean ± standard deviation. The best performance is indicated in bold. The \(\star \) indicates that our method results are statistically significant greater (p < 0.05) than other methods using a two-sided Wilcoxon signed ranks test.
Fig. 3.
figure 3

Cardiac motion estimation comparison on the ES frame of the MRI sequences between (top row to bottom row) the proposed method, Qin et al.’s method [6] and U-Net [7].

4.2 Representative Examples

Cardiac Motion Estimation. Figure 3 shows an example cardiac motion estimation comparison on the 19th frame (ES) of the MRI sequence between the proposed method, Qin et al.’s method [6] and U-Net [7], using spatial-only patterns. It is observed that the proposed method provides a higher MSSIM 0.853 and PSNR 32.556 and a lower NRMSE 0.090 than the other methods on the predicted ES frame. The displacement image visualizes the DDF. Different colours describe the different motion directions, and the colour intensity expresses the magnitude of the displacement. The proposed method estimates higher displacements (visualised as a stronger colour in Fig. 3 middle column) compared to other methods, especially at the centre area of the LV blood pool. The U-Net seems to be less accurate, because it has strong background noise (shown in green) compared to the proposed method and the Qin et al.’s method. The displacement error maps show that the U-Net has the largest difference at the LV and the surrounding area, followed by the method of Qin et al.

Left Ventricular Function Evaluation. In our dataset, we do not have manual image segmentation. In order to do regional analysis of LV function, we ran Bai et al.’s algorithm [1] to get the segmented ED frame. Then we warped the segmented ED frame to other frames in the sequence. Table 2 and Fig. 4 shows an example of a healthy volunteer and a primary pulmonary hypertension (PPH) patient with the proposed method. Figure 4 shows an example of a time series of the endocardial radius, thickness, Err and Ecc in the six segments of myocardium estimated for a healthy volunteer and a PPH patient. Compared to a healthy volunteer, the LV of the PPH patient has poor contraction over the whole cardiac cycle, and as a result, the endocardial radius of a hypertension patient is larger than that of a healthy volunteer. For instance, the endocardial radius (orange) of segment 1 contracts less. Table 2 shows that on the 19th frame (ES), the mean radius of segment 1 is 10.69 pixel from the PPH patient, while the mean radius of segment 1 is 9.79 pixel from the healthy volunteer. In clinical practice, the endocardial radius should take on its smallest value over the cardiac cycle on the ES frame, because the volume of the LV blood pool reaches the minimum value then. Moreover, the LV wall thickness from all six segments is smaller for the PPH patient, compared to the healthy one. Due to the reduced thickness, we conclude that this left ventricle exhibits atrophy.

Table 2. Example results of peak mean value on the ES frame of the motion- characteristic features, time series of the endocardial radius (Endo radius), and thickness, circumferential (Ecc) and radial strain (Err) for cardiac segments (Seg) (0–5) over a cardiac cycle for a healthy volunteer and a primary pulmonary hypertension (PPH) patient in the proposed method.
Fig. 4.
figure 4

Example results of estimated endocardial radius (mean and standard deviation shown), thickness(mean and standard deviation shown), radial strain (mean and standard deviation shown) and circumferential strain (mean and standard deviation shown) for cardiac segments (0–5) plotted over a cardiac cycle. Myocardial segment notation (top); and results for a healthy volunteer (left column), and a primary pulmonary hypertension patient (right column).

5 Discussion

In this work we have proposed a deep learning-based approach to cardiac MR motion analysis that uses a self-supervised paradigm to learn spatio-temporal features in cardiac MR cine loops. The results show the ability of the proposed approach to capture spatio-temporal patterns and predict a dense displacement field (DDF) over a full cardiac cycle. The proposed method has higher accuracy than the method of Qin et al. and U-Net which we attribute to the use of spatio-temporal features. According to our experiments, the best DDF results are obtained when we stack 2 ConvLSTM layers with a 3-pixel kernel size in each layer.

The predicted DDF is employed to deform an ED myocardium mask to other frames and perform regional LV endocardial radius, thickness, Ecc and Err time-series analysis. The results show the potential of the proposed approach to evaluate the clinical parameters for cardiovascular diseases. Currently, we do not use interpolation to smooth feature time series. In our experiments, we find that it is not necessary to smooth the curve. We can use the unsmoothed curve of the endocardial radius to explain the abnormal motion phenomenon in the PPH pathological group.

There are some limitations of this work. The UK BioBank consists of mainly healthy volunteers, and has a sparse number of PPH patients. The model may not well represent the motion and strain patterns typically seen in PPH patients.

6 Conclusion

We present a novel spatio-temporal network to characterise cardiac motion, visualise the dense displacement field and explain motion-characteristic features in a healthy group and a pathological group. The model learns meaningful spatio-temporal patterns of the cardiac motion that can be used for LV regional function analysis. Future work will extend this method to analyse the basal, mid-cavity and apical slices of the LV. The motion and strain analysis method is not disease-specific and could be extended extend to other cardiac conditions such as ischaemic health disease, assuming suitable training examples are available.