Joint Learning of Motion Estimation and Segmentation for Cardiac MR Image Sequences

Qin, Chen; Bai, Wenjia; Schlemper, Jo; Petersen, Steffen E.; Piechnik, Stefan K.; Neubauer, Stefan; Rueckert, Daniel

doi:10.1007/978-3-030-00934-2_53

Chen Qin¹⁸,
Wenjia Bai¹⁸,
Jo Schlemper¹⁸,
Steffen E. Petersen¹⁹,
Stefan K. Piechnik²⁰,
Stefan Neubauer²⁰ &
…
Daniel Rueckert¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11071))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

16k Accesses
84 Citations
8 Altmetric

Abstract

Cardiac motion estimation and segmentation play important roles in quantitatively assessing cardiac function and diagnosing cardiovascular diseases. In this paper, we propose a novel deep learning method for joint estimation of motion and segmentation from cardiac MR image sequences. The proposed network consists of two branches: a cardiac motion estimation branch which is built on a novel unsupervised Siamese style recurrent spatial transformer network, and a cardiac segmentation branch that is based on a fully convolutional network. In particular, a joint multi-scale feature encoder is learned by optimizing the segmentation branch and the motion estimation branch simultaneously. This enables the weakly-supervised segmentation by taking advantage of features that are unsupervisedly learned in the motion estimation branch from a large amount of unannotated data. Experimental results using cardiac MlRI images from 220 subjects show that the joint learning of both tasks is complementary and the proposed models outperform the competing methods significantly in terms of accuracy and speed.

You have full access to this open access chapter, Download conference paper PDF

Joint Group-Wise Motion Estimation and Segmentation of Cardiac Cine MR Images Using Recurrent U-Net

Cardiac MR Image Sequence Segmentation with Temporal Motion Encoding

Simultaneous Segmentation and Motion Estimation of Left Ventricular Myocardium in 3D Echocardiography Using Multi-task Learning

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Cardiac magnetic resonance imaging (MRI) is one of the reference methods to provide qualitative and quantitative information of the morphology and function of the heart, which can be utilized to assess cardiovascular diseases. Both cardiac MR image segmentation and motion estimation are crucial steps for the dynamic exploration of the cardiac function, which enable the accurate quantification of regional function measures such as changes in ventricular volumes and the elasticity and contractility properties of the myocardium [13]. Traditionally, most approaches consider segmentation and motion estimation as two separate problems. However, these two tasks are known to be closely related [6, 17], and learning meaningful representations for one problem should be helpful to learn representations for the other one.

In this paper, we propose a joint deep learning network for predicting the segmentation and motion estimation simultaneously for cardiac MR sequences. In particular, the proposed architecture consists of two branches: one is an unsupervised Siamese style spatial transformer network for cardiac motion estimation, which exploits multi-scale features and recurrent units to accurately predict sequences of motion fields while ensuring spatio-temporal smoothness; and the other one is a segmentation branch which takes advantage of the joint feature learning to enable weakly-supervised segmentation for temporally sparse annotated data. We formulate the problem as a composite loss function optimized by training both tasks simultaneously. Using experiments with cardiac MRI from 220 subjects, we show that the proposed models can significantly improve the performance.

1.1 Related Work

In recent years, many works in deep learning domain have been proposed for cardiac MR image segmentation. Most of these approaches employ a fully convolutional network which learns useful features by training on manually annotated images and predicts a pixel-wise label map [2,3,4, 10]. However, in real world applications, normally only end-distolic (ED) and end-systolic (ES) frames are manually annotated in a sequence of cardiac MR images, while information contained in other frames is not exploited in previous works. On the other hand, traditional methods commonly extended classical optical flow or image registration methods for cardiac motion estimation [7, 13, 14, 16]. For instance, De Craene et al. [7] optimized a 4D velocity field parameterized by B-Spline spatio-temporal kernels to introduce temporal consistency, and Shi et al. [14] combined different MR sequences to estimate myocardial motion using a series of free-form deformations (FFD) [12]. In recent years, some deep learning works [15, 18] have also been proposed for medical image registration. They either trained networks to learn similarity metrics or simulated transformations as ground truth to learn the regression. In contrast, our proposed method is a unified model for learning both cardiac motion estimation and segmentation, where no motion ground truth is required and only temporally sparse annotated frames in a cardiac cycle are needed. Of particular relevance to our approach are works [6, 11] proposed in computer vision domain. Segflow [6] used a joint learning framework for natural video object segmentation and optical flow, and the work [11] propagated labels using the estimated flow to enable weakly-supervised segmentation. In comparison, our method proposes a different way to couple both tasks by learning a joint feature encoder, which exploits the massive information contained in unlabeled data and explores the redundancy of the feature representation for both tasks.

2 Methods

Our goal is to realize the simultaneous motion estimation and segmentation for cardiac MR image sequences. Here we construct a unified model consisting of two branches: an unsupervised motion estimation branch based on a Siamese style recurrent multi-scale spatial transformer network, and a segmentation branch based on a fully convolutional neural network, where the two branches share a joint feature encoder. The overall architecture of the model is shown in Fig. 1.

2.1 Unsupervised Cardiac Motion Estimation

Deep learning methods normally rely heavily on the ground truth labeled data. However, in problems of cardiac motion estimation, dense transformation maps between frames are rarely available. Inspired by the success of spatial transformer network [5, 9, 11] which effectively encodes optical flow to describe motion, here we propose a novel Siamese style multi-scale recurrent network for estimating the cardiac motion of MR image sequences without supervision effort. A schematic illustration of the model is shown in Fig. 1(a).

The task is to find a sequence of consecutive optical flow representations between the target frame $I_{t}$ and the source frames $I_{t+1}, I_{t+2}, ..., I_{t+T}$, where the output is pixel-wise 2D motion fields $\varDelta $ representing the displacement in x and y directions. In order to realize this, the proposed network mainly consists of four components: a Siamese network for the feature extraction of both target frame and source frame; a multi-scale concatenation of features from pairs of frames; a convolutional recurrent unit (RNN) which propagates information along temporal dimension; and a sampler that warps the source frame to the target one by using the estimated displacement field. In details, inspired by the success of cardiac segmentation network proposed in [4], we determine the Siamese feature encoder as the one in [4] which is adapted from VGG-16 net. For the combination of information from frame pairs, motivated by the traditional multi-level registration method [12], here we propose to concatenate multi-scale features from both streams of Siamese network to exploit information at different scales. This is followed by a convolution and upsampling operation back to the original resolution, and combined using a concatenation layer. In addition, in order to exploit information from consecutive frames and also to ensure the spatio-temporal smoothness of the estimated motion fields, we additionally incorporate a convolutional simple RNN with tanh function at the last layer to propagate motion information along the temporal dimension and to estimate flow with two feature maps $\varDelta = (\varDelta x, \varDelta y; \theta _{\varDelta })$ corresponding to displacements for the x and y dimensions, where the network is parameterized by $\theta _{\varDelta }$. Finally, the source frames $I_{t+k}$ are transformed using bilinear interpolation to the target frame, which can be expressed as $I_{t+k}^{'}(x,y) = \varGamma \{I_{t+k}(x+\varDelta _{t+k}x, y+\varDelta _{t+k}y)\}$.

To train the spatial transformer, we optimize the network by minimizing the pixel-wise mean squared error between the transformed frames and the target frame. To ensure local smoothness, we penalize the gradients of flow map by using an approximation of Huber loss proposed in [5], namely $\mathcal {H}(\delta _{x,y}\varDelta _t) = \sqrt{\epsilon +\sum _{i=x,y}(\delta _{x}\varDelta i^2+\delta _{y}\varDelta i^2)}$ and similarly, we use a regularization term $\mathcal {H}(\delta _{t}\varDelta ) = \sqrt{\epsilon +\sum _{i=x,y,t}\delta _{t}\varDelta i^2}$ to constrain the flow to behave smoothly in temporal dimension, where $\epsilon =0.01$. Therefore, the loss function can be described as follows:

$$\begin{aligned} \mathcal {L}_m = \frac{1}{T}\sum _{k=1}^{T}[\Vert I_t-I_{t+k}^{'}\Vert ^2+\alpha \mathcal {H}(\delta _{x,y}\varDelta _{t+k})] + {\beta } \mathcal {H}(\delta _{t}\varDelta ), \end{aligned}$$

(1)

where T is the number of sequence, $\alpha $ and $\beta $ are regularization parameters to trade off between image dissimilarity, local and temporal smoothness.

2.2 Joint Model for Cardiac Motion Estimation and Segmentation

As we know, motion estimation and segmentation tasks are closely related, and previous works in computer vision domain have shown that the learning of one task is able to benefit the other [6, 17]. Motivated by the success of self-supervised learning which learns features from intrinsic freely available signals [1, 8], here we propose a joint learning model for cardiac motion estimation and segmentation, where features learned from unsupervised (or self-supervised) motion estimation are exploited for segmentation. By coupling the motion estimation and segmentation network, the proposed approach can be viewed as a weakly-supervised method with temporally sparse annotated data while motion estimation facilitates the feature learning by exploring those unlabeled data. The schematic architecture of the unified model is shown in Fig. 1.

In details, the proposed joint model consists of two branches: the motion estimation branch proposed in Sect. 2.1, and the segmentation branch based on the effective network proposed in [4]. Here both branches share the joint feature encoder (Siamese style network) as shown in Fig. 1, so that the features learned can better capture the useful related representations for both tasks. Here a categorical cross-entropy loss $\mathcal {L}_s = -\sum _{l\in L} y_{l} log ( f(x_{l};\varTheta ) )$ on labeled data set L is used for segmentation branch, in which we define $x_l$ as the input data, $y_l$ as the ground truth, and f is the segmentation function parameterized by $\varTheta $. In addition, to further exploit the input unlabeled data, we add an additional spatial transformer in segmentation branch, which warps the predicted segmentation to the target frame using the motion fields estimated from motion estimation branch. Similarly, a categorical cross-entropy loss $\mathcal {L}_w = -\sum _{n\in U} y_{l} log ( f_w(x_{n};\varTheta ) )$ is used between the warped segmentations and the target, where U stands for unlabeled data set, and $f_w$ is f plus the warp operation. This component mainly works as a regularization for the motion estimation branch, which is supposed to improve the estimation around boundaries.

As a result, a composite loss function consisting of image similarity error, smoothness penalty of motion fields, and pixel-wise cross entropy segmentation losses with the softmax function can be defined as follows:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{m}+\lambda _1\mathcal {L}_{s}+\lambda _2 \mathcal {L}_{w}, \end{aligned}$$

(2)

where $\lambda _1$ and $\lambda _2$ are trade-off parameters for different tasks. To initialize the joint model, we first train the motion estimation branch using all the available data we have. Then we fix the weights of the shared feature encoder, and train the segmentation branch with the available annotated data. Lastly, we jointly train both branches by minimizing the composite loss function on training set.

3 Experiments and Results

Experiments were performed on 220 short-axis cardiac MR sequences from UK Biobank study. Each scan contains a sequence of 50 frames, where manual segmentations of left-ventricular (LV) cavity, the myocardium (Myo) and the right-ventricular (RV) cavity are available on ED and ES frames. A short-axis image stack typically consists of 10 image slices. For pre-processing, all training images were cropped to the same size of $192\times 192$, and intensity was normalized to the range of [0,1]. In our experiments, we split the data into 100/100/20 for training/testing/validation. Parameters used in the loss function were set to be $\alpha = 0.001$, $\beta = 0.0001$, $\lambda _1=0.01$ and $\lambda _2=0.001$, which were chosen via validation set. The number of image sequence for RNN during training was $T=10$, and a learning rate of 0.0001 was used. Data augmentation was performed on-the-fly, with random rotation, translation, and scaling.

Evaluation was performed with respect to both segmentation and motion estimation. We first evaluated the segmentation performance of the joint model by comparing it with the baseline method, i.e., training the segmentation branch only (Seg only). Results reported in Table 1 are Dice scores computed with manual annotations on LV, Myo, and RV. It shows that the proposed joint model significantly outperforms the results of Seg only on all three structures with $p \ll 0.001$ using Wilcoxon signed rank test, especially on Myo where motion normally affects the segmentation accuracy greatly. This indicates the merits of joint feature learning, where features explored by motion estimation are beneficial for segmentation task.

Table 1. Evaluation of segmentation accuracy for the proposed joint model and the baseline (Seg only) method in terms of the Dice Metric (mean and standard deviation).

Full size table

Table 2. Evaluation of motion estimation accuracy for FFD, proposed model in Sect. 2.1 (Motion only) and the proposed joint model in terms of the mean contour distance (MCD) and Hausdorff distance (HD) in mm (mean and standard deviation). Time reported is testing time on 50 frames in a cardiac cycle per slice.

Full size table

We also evaluated the performance of motion estimation by comparing the results obtained using a B-spline free-form deformation (FFD) algorithm^{Footnote 1} [12], network proposed in Sect. 2.1 (Motion only), and the joint model proposed in Sect. 2.2. We warped the segmentations of ES frame to ED frame by using the estimated motion fields, and mean contour distance (MCD) and Hausdorff distance (HD) were computed between the transformed segmentations and the segmentations of ED frame. Table 2 shows the comparison results of these methods. It can be observed that both of the proposed methods outperform FFD registration method in terms of MCD and HD on all the three structures ($p \ll 0.001$) and similarly, the joint model shows better performance than the model trained for motion estimation only ($p \ll 0.001$ on LV and RV, and $p<0.01$ on Myo). Additionally, we compared the test time needed for motion estimation on 50 frames of a single slice in a cardiac cycle, and results indicated a faster speed of proposed methods compared to FFD.

Furthermore, the proposed joint method is capable of predicting a sequence of estimated motion fields and segmentations simultaneously. Here we show a visualization result of the network predictions with segmentations and motions combined on frames in a cardiac cycle in Fig. 2. Myocardial motion indicated by the yellow arrows were established between ED and other time frames. Note that the network predicts dense motion fields, while for better visualization, we only show a sparse representation around myocardium. To further validate the proposed unified model in terms of the motion estimation, Fig. 3(a)(b) shows a labeling results of the LV and RV boundaries along temporal dimension, which is obtained by warping the labeled segmentations available in ED frame to other time points, and Fig. 3(c) calculated the transformed LV volume over the cardiac cycle. These show that the proposed model is able to produce an accurate estimation, which is also smooth and consistent over time.

4 Conclusion

In this paper, we have presented a novel deep learning model for joint motion estimation and segmentation of cardiac MR image sequence. The proposed architecture is composed of two branches: a proposed unsupervised Siamese style recurrent spatial transformer network for motion estimation and a segmentation branch based on a fully convolutional network. A joint feature encoder is shared between the two branches, which enables the effective feature learning via multi-task training and also the weakly-supervised segmentation in terms of the temporally sparse annotated data. Experimental results showed significant improvements of proposed models against baseline approaches in terms of accuracy and speed. For the future work, we will validate our method on a larger scale dataset, and will also investigate its usefulness on 3D applications.

Notes

1.
https://github.com/BioMedIA/MIRTK.

References

Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV, pp. 37–45 (2015)
Google Scholar
Avendi, M., Kheradvar, A., Jafarkhani, H.: A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Med. Image Anal. 30, 108–119 (2016)
Article Google Scholar
Bai, W., et al.: Semi-supervised learning for network-based cardiac MR image segmentation. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10434, pp. 253–260. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66185-8_29
Chapter Google Scholar
Bai, W., Sinclair, M., Tarroni, G., et al.: Automated cardiovascular magnetic resonance image analysis with fully convolutional networks. J. Cardiovasc. Magn. Reson. (2018)
Google Scholar
Caballero, J., Ledig, C., Aitken, A., et al.: Real-time video super-resolution with spatio-temporal networks and motion compensation. In: CVPR (2017)
Google Scholar
Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: ICCV, pp. 686–695 (2017)
Google Scholar
De Craene, M., Piella, G., Camara, O., et al.: Temporal diffeomorphic free-form deformation: application to motion and strain estimation from 3D echocardiography. Med. Image Anal. 16(2), 427–450 (2012)
Article Google Scholar
Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: ICCV (2017)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NIPS, pp. 2017–2025 (2015)
Google Scholar
Ngo, T.A., Lu, Z., Carneiro, G.: Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance. Med. Image Anal. 35, 159–171 (2017)
Article Google Scholar
Patraucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. In: ICLR Workshop (2016)
Google Scholar
Rueckert, D., Sonoda, L.I., Hayes, C., et al.: Nonrigid registration using free-form deformations: application to breast MR images. IEEE Trans. Med. Imaging 18(8), 712–721 (1999)
Article Google Scholar
Shen, D., Sundar, H., Xue, Z., Fan, Y., Litt, H.: Consistent estimation of cardiac motions by 4D image registration. In: Duncan, J.S., Gerig, G. (eds.) MICCAI 2005. LNCS, vol. 3750, pp. 902–910. Springer, Heidelberg (2005). https://doi.org/10.1007/11566489_111
Chapter Google Scholar
Shi, W., Zhuang, X., Wang, H., et al.: A comprehensive cardiac motion estimation framework using both untagged and 3-D tagged MR images based on nonrigid registration. IEEE Trans. Med. Imaging 31(6), 1263–1275 (2012)
Article Google Scholar
Simonovsky, M., Gutiérrez-Becker, B., Mateus, D., et al.: A deep metric for multimodal registration. In: MICCAI, pp. 10–18 (2016)
Google Scholar
Tobon-Gomez, C., De Craene, M., Mcleod, K., et al.: Benchmarking framework for myocardial tracking and deformation algorithms: an open access database. Med. Image Anal. 17(6), 632–648 (2013)
Article Google Scholar
Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR, pp. 3899–3908 (2016)
Google Scholar
Uzunova, H., Wilms, M., Handels, H., Ehrhardt, J.: Training CNNs for image registration from few samples with model-based data augmentation. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10433, pp. 223–231. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66182-7_26
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing, Imperial College London, London, UK
Chen Qin, Wenjia Bai, Jo Schlemper & Daniel Rueckert
NIHR Biomedical Research Centre at Barts, Queen Mary University of London, London, UK
Steffen E. Petersen
Division of Cardiovascular Medicine, Radcliffe Department of Medicine, University of Oxford, Oxford, UK
Stefan K. Piechnik & Stefan Neubauer

Authors

Chen Qin
View author publications
You can also search for this author in PubMed Google Scholar
Wenjia Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jo Schlemper
View author publications
You can also search for this author in PubMed Google Scholar
Steffen E. Petersen
View author publications
You can also search for this author in PubMed Google Scholar
Stefan K. Piechnik
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Neubauer
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Rueckert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Qin .

Editor information

Editors and Affiliations

University of Leeds, Leeds, UK
Alejandro F. Frangi
King’s College London, London, UK
Julia A. Schnabel
University of Pennsylvania, Philadelphia, PA, USA
Christos Davatzikos
Universidad de Valladolid, Valladolid, Spain
Carlos Alberola-López
Queen’s University, Kingston, ON, Canada
Gabor Fichtinger

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 62 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qin, C. et al. (2018). Joint Learning of Motion Estimation and Segmentation for Cardiac MR Image Sequences. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11071. Springer, Cham. https://doi.org/10.1007/978-3-030-00934-2_53

Download citation

DOI: https://doi.org/10.1007/978-3-030-00934-2_53
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00933-5
Online ISBN: 978-3-030-00934-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us