Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Cardiac magnetic resonance imaging (MRI) is one of the reference methods to provide qualitative and quantitative information of the morphology and function of the heart, which can be utilized to assess cardiovascular diseases. Both cardiac MR image segmentation and motion estimation are crucial steps for the dynamic exploration of the cardiac function, which enable the accurate quantification of regional function measures such as changes in ventricular volumes and the elasticity and contractility properties of the myocardium [13]. Traditionally, most approaches consider segmentation and motion estimation as two separate problems. However, these two tasks are known to be closely related [6, 17], and learning meaningful representations for one problem should be helpful to learn representations for the other one.

In this paper, we propose a joint deep learning network for predicting the segmentation and motion estimation simultaneously for cardiac MR sequences. In particular, the proposed architecture consists of two branches: one is an unsupervised Siamese style spatial transformer network for cardiac motion estimation, which exploits multi-scale features and recurrent units to accurately predict sequences of motion fields while ensuring spatio-temporal smoothness; and the other one is a segmentation branch which takes advantage of the joint feature learning to enable weakly-supervised segmentation for temporally sparse annotated data. We formulate the problem as a composite loss function optimized by training both tasks simultaneously. Using experiments with cardiac MRI from 220 subjects, we show that the proposed models can significantly improve the performance.

1.1 Related Work

In recent years, many works in deep learning domain have been proposed for cardiac MR image segmentation. Most of these approaches employ a fully convolutional network which learns useful features by training on manually annotated images and predicts a pixel-wise label map [2,3,4, 10]. However, in real world applications, normally only end-distolic (ED) and end-systolic (ES) frames are manually annotated in a sequence of cardiac MR images, while information contained in other frames is not exploited in previous works. On the other hand, traditional methods commonly extended classical optical flow or image registration methods for cardiac motion estimation [7, 13, 14, 16]. For instance, De Craene et al. [7] optimized a 4D velocity field parameterized by B-Spline spatio-temporal kernels to introduce temporal consistency, and Shi et al. [14] combined different MR sequences to estimate myocardial motion using a series of free-form deformations (FFD) [12]. In recent years, some deep learning works [15, 18] have also been proposed for medical image registration. They either trained networks to learn similarity metrics or simulated transformations as ground truth to learn the regression. In contrast, our proposed method is a unified model for learning both cardiac motion estimation and segmentation, where no motion ground truth is required and only temporally sparse annotated frames in a cardiac cycle are needed. Of particular relevance to our approach are works [6, 11] proposed in computer vision domain. Segflow [6] used a joint learning framework for natural video object segmentation and optical flow, and the work [11] propagated labels using the estimated flow to enable weakly-supervised segmentation. In comparison, our method proposes a different way to couple both tasks by learning a joint feature encoder, which exploits the massive information contained in unlabeled data and explores the redundancy of the feature representation for both tasks.

2 Methods

Our goal is to realize the simultaneous motion estimation and segmentation for cardiac MR image sequences. Here we construct a unified model consisting of two branches: an unsupervised motion estimation branch based on a Siamese style recurrent multi-scale spatial transformer network, and a segmentation branch based on a fully convolutional neural network, where the two branches share a joint feature encoder. The overall architecture of the model is shown in Fig. 1.

Fig. 1.
figure 1

The overall schematic architecture of proposed network for joint estimation of cardiac motion and segmentation. (a) The proposed Siamese style multi-scale recurrent motion estimation branch. (b) The segmentation branch which shares the joint feature encoder with motion estimation branch. The architecture for feature encoder is adopted from VGG-16 net before FC layer. Both branches have the same head architecture as the one proposed in [4], and the concatenation layers of motion estimation branch are from last layers at different scales of the feature encoder. For detailed architecture, please refer to supplementary material.

2.1 Unsupervised Cardiac Motion Estimation

Deep learning methods normally rely heavily on the ground truth labeled data. However, in problems of cardiac motion estimation, dense transformation maps between frames are rarely available. Inspired by the success of spatial transformer network [5, 9, 11] which effectively encodes optical flow to describe motion, here we propose a novel Siamese style multi-scale recurrent network for estimating the cardiac motion of MR image sequences without supervision effort. A schematic illustration of the model is shown in Fig. 1(a).

The task is to find a sequence of consecutive optical flow representations between the target frame \(I_{t}\) and the source frames \(I_{t+1}, I_{t+2}, ..., I_{t+T}\), where the output is pixel-wise 2D motion fields \(\varDelta \) representing the displacement in x and y directions. In order to realize this, the proposed network mainly consists of four components: a Siamese network for the feature extraction of both target frame and source frame; a multi-scale concatenation of features from pairs of frames; a convolutional recurrent unit (RNN) which propagates information along temporal dimension; and a sampler that warps the source frame to the target one by using the estimated displacement field. In details, inspired by the success of cardiac segmentation network proposed in [4], we determine the Siamese feature encoder as the one in [4] which is adapted from VGG-16 net. For the combination of information from frame pairs, motivated by the traditional multi-level registration method [12], here we propose to concatenate multi-scale features from both streams of Siamese network to exploit information at different scales. This is followed by a convolution and upsampling operation back to the original resolution, and combined using a concatenation layer. In addition, in order to exploit information from consecutive frames and also to ensure the spatio-temporal smoothness of the estimated motion fields, we additionally incorporate a convolutional simple RNN with tanh function at the last layer to propagate motion information along the temporal dimension and to estimate flow with two feature maps \(\varDelta = (\varDelta x, \varDelta y; \theta _{\varDelta })\) corresponding to displacements for the x and y dimensions, where the network is parameterized by \(\theta _{\varDelta }\). Finally, the source frames \(I_{t+k}\) are transformed using bilinear interpolation to the target frame, which can be expressed as \(I_{t+k}^{'}(x,y) = \varGamma \{I_{t+k}(x+\varDelta _{t+k}x, y+\varDelta _{t+k}y)\}\).

To train the spatial transformer, we optimize the network by minimizing the pixel-wise mean squared error between the transformed frames and the target frame. To ensure local smoothness, we penalize the gradients of flow map by using an approximation of Huber loss proposed in [5], namely \(\mathcal {H}(\delta _{x,y}\varDelta _t) = \sqrt{\epsilon +\sum _{i=x,y}(\delta _{x}\varDelta i^2+\delta _{y}\varDelta i^2)}\) and similarly, we use a regularization term \(\mathcal {H}(\delta _{t}\varDelta ) = \sqrt{\epsilon +\sum _{i=x,y,t}\delta _{t}\varDelta i^2}\) to constrain the flow to behave smoothly in temporal dimension, where \(\epsilon =0.01\). Therefore, the loss function can be described as follows:

$$\begin{aligned} \mathcal {L}_m = \frac{1}{T}\sum _{k=1}^{T}[\Vert I_t-I_{t+k}^{'}\Vert ^2+\alpha \mathcal {H}(\delta _{x,y}\varDelta _{t+k})] + {\beta } \mathcal {H}(\delta _{t}\varDelta ), \end{aligned}$$
(1)

where T is the number of sequence, \(\alpha \) and \(\beta \) are regularization parameters to trade off between image dissimilarity, local and temporal smoothness.

2.2 Joint Model for Cardiac Motion Estimation and Segmentation

As we know, motion estimation and segmentation tasks are closely related, and previous works in computer vision domain have shown that the learning of one task is able to benefit the other [6, 17]. Motivated by the success of self-supervised learning which learns features from intrinsic freely available signals [1, 8], here we propose a joint learning model for cardiac motion estimation and segmentation, where features learned from unsupervised (or self-supervised) motion estimation are exploited for segmentation. By coupling the motion estimation and segmentation network, the proposed approach can be viewed as a weakly-supervised method with temporally sparse annotated data while motion estimation facilitates the feature learning by exploring those unlabeled data. The schematic architecture of the unified model is shown in Fig. 1.

In details, the proposed joint model consists of two branches: the motion estimation branch proposed in Sect. 2.1, and the segmentation branch based on the effective network proposed in [4]. Here both branches share the joint feature encoder (Siamese style network) as shown in Fig. 1, so that the features learned can better capture the useful related representations for both tasks. Here a categorical cross-entropy loss \(\mathcal {L}_s = -\sum _{l\in L} y_{l} log ( f(x_{l};\varTheta ) )\) on labeled data set L is used for segmentation branch, in which we define \(x_l\) as the input data, \(y_l\) as the ground truth, and f is the segmentation function parameterized by \(\varTheta \). In addition, to further exploit the input unlabeled data, we add an additional spatial transformer in segmentation branch, which warps the predicted segmentation to the target frame using the motion fields estimated from motion estimation branch. Similarly, a categorical cross-entropy loss \(\mathcal {L}_w = -\sum _{n\in U} y_{l} log ( f_w(x_{n};\varTheta ) )\) is used between the warped segmentations and the target, where U stands for unlabeled data set, and \(f_w\) is f plus the warp operation. This component mainly works as a regularization for the motion estimation branch, which is supposed to improve the estimation around boundaries.

As a result, a composite loss function consisting of image similarity error, smoothness penalty of motion fields, and pixel-wise cross entropy segmentation losses with the softmax function can be defined as follows:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{m}+\lambda _1\mathcal {L}_{s}+\lambda _2 \mathcal {L}_{w}, \end{aligned}$$
(2)

where \(\lambda _1\) and \(\lambda _2\) are trade-off parameters for different tasks. To initialize the joint model, we first train the motion estimation branch using all the available data we have. Then we fix the weights of the shared feature encoder, and train the segmentation branch with the available annotated data. Lastly, we jointly train both branches by minimizing the composite loss function on training set.

3 Experiments and Results

Experiments were performed on 220 short-axis cardiac MR sequences from UK Biobank study. Each scan contains a sequence of 50 frames, where manual segmentations of left-ventricular (LV) cavity, the myocardium (Myo) and the right-ventricular (RV) cavity are available on ED and ES frames. A short-axis image stack typically consists of 10 image slices. For pre-processing, all training images were cropped to the same size of \(192\times 192\), and intensity was normalized to the range of [0,1]. In our experiments, we split the data into 100/100/20 for training/testing/validation. Parameters used in the loss function were set to be \(\alpha = 0.001\), \(\beta = 0.0001\), \(\lambda _1=0.01\) and \(\lambda _2=0.001\), which were chosen via validation set. The number of image sequence for RNN during training was \(T=10\), and a learning rate of 0.0001 was used. Data augmentation was performed on-the-fly, with random rotation, translation, and scaling.

Evaluation was performed with respect to both segmentation and motion estimation. We first evaluated the segmentation performance of the joint model by comparing it with the baseline method, i.e., training the segmentation branch only (Seg only). Results reported in Table 1 are Dice scores computed with manual annotations on LV, Myo, and RV. It shows that the proposed joint model significantly outperforms the results of Seg only on all three structures with \(p \ll 0.001\) using Wilcoxon signed rank test, especially on Myo where motion normally affects the segmentation accuracy greatly. This indicates the merits of joint feature learning, where features explored by motion estimation are beneficial for segmentation task.

Table 1. Evaluation of segmentation accuracy for the proposed joint model and the baseline (Seg only) method in terms of the Dice Metric (mean and standard deviation).
Table 2. Evaluation of motion estimation accuracy for FFD, proposed model in Sect. 2.1 (Motion only) and the proposed joint model in terms of the mean contour distance (MCD) and Hausdorff distance (HD) in mm (mean and standard deviation). Time reported is testing time on 50 frames in a cardiac cycle per slice.

We also evaluated the performance of motion estimation by comparing the results obtained using a B-spline free-form deformation (FFD) algorithmFootnote 1 [12], network proposed in Sect. 2.1 (Motion only), and the joint model proposed in Sect. 2.2. We warped the segmentations of ES frame to ED frame by using the estimated motion fields, and mean contour distance (MCD) and Hausdorff distance (HD) were computed between the transformed segmentations and the segmentations of ED frame. Table 2 shows the comparison results of these methods. It can be observed that both of the proposed methods outperform FFD registration method in terms of MCD and HD on all the three structures (\(p \ll 0.001\)) and similarly, the joint model shows better performance than the model trained for motion estimation only (\(p \ll 0.001\) on LV and RV, and \(p<0.01\) on Myo). Additionally, we compared the test time needed for motion estimation on 50 frames of a single slice in a cardiac cycle, and results indicated a faster speed of proposed methods compared to FFD.

Fig. 2.
figure 2

Visualization results for simultaneous prediction of motion estimation and segmentation. Myocardial motions are from ED to other time points. Please refer to supplementary material for a dynamic video of a cardiac cycle.

Furthermore, the proposed joint method is capable of predicting a sequence of estimated motion fields and segmentations simultaneously. Here we show a visualization result of the network predictions with segmentations and motions combined on frames in a cardiac cycle in Fig. 2. Myocardial motion indicated by the yellow arrows were established between ED and other time frames. Note that the network predicts dense motion fields, while for better visualization, we only show a sparse representation around myocardium. To further validate the proposed unified model in terms of the motion estimation, Fig. 3(a)(b) shows a labeling results of the LV and RV boundaries along temporal dimension, which is obtained by warping the labeled segmentations available in ED frame to other time points, and Fig. 3(c) calculated the transformed LV volume over the cardiac cycle. These show that the proposed model is able to produce an accurate estimation, which is also smooth and consistent over time.

Fig. 3.
figure 3

(a) (b) Labeling results obtained by warping the ED frame segmentation to other time points using FFD and the proposed joint model. Results are shown in temporal views of the red short-axis line. (c) Left ventricular volume (ml) of the subject by warping the ED frame segmentation to other time points in a cardiac cycle. (Color figure online)

4 Conclusion

In this paper, we have presented a novel deep learning model for joint motion estimation and segmentation of cardiac MR image sequence. The proposed architecture is composed of two branches: a proposed unsupervised Siamese style recurrent spatial transformer network for motion estimation and a segmentation branch based on a fully convolutional network. A joint feature encoder is shared between the two branches, which enables the effective feature learning via multi-task training and also the weakly-supervised segmentation in terms of the temporally sparse annotated data. Experimental results showed significant improvements of proposed models against baseline approaches in terms of accuracy and speed. For the future work, we will validate our method on a larger scale dataset, and will also investigate its usefulness on 3D applications.