Simultaneous Segmentation and Motion Estimation of Left Ventricular Myocardium in 3D Echocardiography Using Multi-task Learning

Ta, Kevinminh; Ahn, Shawn S.; Stendahl, John C.; Langdon, Jonathan; Sinusas, Albert J.; Duncan, James S.

doi:10.1007/978-3-030-93722-5_14

Kevinminh Ta¹⁶,
Shawn S. Ahn¹⁶,
John C. Stendahl¹⁷,
Jonathan Langdon¹⁹,
Albert J. Sinusas^17,19 &
…
James S. Duncan^16,18,19

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13131))

Included in the following conference series:

International Workshop on Statistical Atlases and Computational Models of the Heart

1451 Accesses
1 Citations

Abstract

Motion estimation and segmentation are both critical steps in identifying and assessing myocardial dysfunction, but are traditionally treated as unique tasks and solved as separate steps. However, many motion estimation techniques rely on accurate segmentations. It has been demonstrated in the computer vision and medical image analysis literature that both these tasks may be mutually beneficial when solved simultaneously. In this work, we propose a multi-task learning network that can concurrently predict volumetric segmentations of the left ventricle and estimate motion between 3D echocardiographic image pairs. The model exploits complementary latent features between the two tasks using a shared feature encoder with task-specific decoding branches. Anatomically inspired constraints are incorporated to enforce realistic motion patterns. We evaluate our proposed model on an in vivo 3D echocardiographic canine dataset. Results suggest that coupling these two tasks in a learning framework performs favorably when compared against single task learning and other alternative methods.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Semi-supervised Joint Network for Simultaneous Left Ventricular Motion Tracking and Segmentation in 4D Echocardiography

Joint Learning of Motion Estimation and Segmentation for Cardiac MR Image Sequences

Recognizing End-Diastole and End-Systole Frames via Deep Temporal Regression Network

Keywords

1 Introduction

Reductions in myocardial blood flow due to coronary artery disease (CAD) can result in myocardial ischemia or infarction and subsequent regional myocardial dysfunction. Echocardiography provides a non-invasive and cost-efficient tool for clinicians to visually analyze left ventricle (LV) wall motion and assess regional dysfunction in the myocardium. However, such a qualitative method of analysis is subjective by nature and, as a result, is prone to high intra- and interobserver variability.

Motion estimation algorithms provide an objective method for characterizing myocardial contractile function, while segmentation algorithms assist in localizing regions of ischemic tissue. Traditional approaches treat these tasks as unique steps and solve them separately, but recent efforts in the medical image analysis and computer vision fields suggest that these two tasks may be mutually beneficial when optimized simultaneously [1,2,3].

In this paper, we propose a multi-task deep learning network to simultaneously segment the LV and estimate its motion between time frames in a 3D echocardiographic (3DE) sequence. The main contributions of this work are as follows: 1) We introduce a novel multi-task learning architecture with residual blocks to solve both 3D motion estimation and volumetric segmentation using a weight sharing feature encoder with task-specific decoding branches; 2) We incorporate anatomically inspired constraints to encourage realistic cardiac motion estimation; 3) We apply our proposed model to 3DE sequences which typically provides additional challenges over Magnetic Resonance (MR) and Computed Tomography (CT) due to lower signal-to-noise ratio as well as over 2DE due to higher dimensionality and smaller spatial and temporal resolution.

2 Related Works

Classic model-based segmentation methods such as active shape and appearance models usually require a priori knowledge or large amounts of feature engineering to achieve adequate results [4, 5]. In recent years, data-driven deep learning based approaches have shown promising results, but still suffer challenges due to inherent ultrasound image properties such as low signal-to-noise ratio and low image contrast [4, 6]. This becomes increasingly problematic as cardiac motion estimation approaches often rely on accurate segmentations to act as anatomical guides for surface and shape tracking or for the placement of deformable grid points [7, 8]. Errors in segmentation predictions can be propagated to motion estimations. While it is conceivable that an expert clinician can manually segment or adjust algorithm predictions prior to the motion estimation step, this is a tedious and infeasible workaround. Several deep learning based approaches have been shown to be successful in estimating motion in the computer vision field, but the difficulty in obtaining true LV motion in clinical data makes supervised approaches challenging. Unsupervised approaches which seek to maximize intensity similarity between image pairs have been successful in MR and CT, however there are still limited applications in 3DE [9,10,11,12].

In recent years, efforts have been made to combine the tasks of motion estimation and segmentation. Qin et al. [2] proposed a Siamese-style joint encoder network using a VGG-16 based architecture that demonstrates promising results when applied to 2D MR cardiac images. The work done in [13] adapts this idea to 2D echocardiography by adopting a feature bridging framework [1] with anatomically inspired constraints. This is further expanded to 3DE in [3] through the use of an iterative training approach where results from one task influence the training of the other [14]. In this work, we propose an alternative novel framework for combining motion estimation and segmentation in 3DE that uses a shared feature encoder to exploit complementary latent representations in the data, which the iterative style of [3] is not capable of doing. In addition to being able to estimate 3D motion and predict volumetric LV segmentations, our model further differs from [2] through the implementation of a 3D U-Net-style architecture [15] with residual blocks [21] for the encoding and decoding branches as opposed to a VGG-16 and FCN architecture (Fig. 1).

3 Methods

3.1 Motion Estimation Branch

Motion estimation algorithms aim to determine the voxel-wise displacement between two sequential images. Given a source image $I_{source}$ and a target image $I_{target}$, motion estimation algorithms can be described by their formulation of the mapping function F such that $F(I_{source},I_{target}) \rightarrow U_{x,y,z}$ where $U_{x,y,z}$ is the displacement along the x-y-z directions. Supervised deep learning formulations of F seek to directly learn the regression between the image pairs and ground truth displacement fields. However due to the scarcity of ground truth in cardiac motion, the motion branch of our network is designed and trained in an unsupervised manner similar to the framework presented in [10], which utilizes a spatial transformation to maximize a similarity metric between a warped source image and a target image.

Our proposed motion branch is comprised of a 3D U-Net inspired architecture with a downsampling analysis path followed by an upsampling synthesis path [15, 16]. Skip connections are used to branch features learned in the analysis path with features learned in the synthesis path. Our model also uses residual blocks in order to improve model performance and training efficiency [21]. The downsampling analysis path serves as a feature encoder which shares its weights with the segmentation branch. The input to the motion branch is a pair of 3D images, $I_{source}$ and $I_{target}$. The branch outputs a displacement field $U_{x,y,z}$ which describes the motion from $I_{source}$ to $I_{target}$. The displacement field is then used to morph $I_{source}$ to match $I_{target}$ as described in [10]. The objective of the network is to maximize the similarity between the morphed $I_{source}$ and $I_{target}$ by minimizing the mean squared error between each corresponding voxel p in the two frames. This can be described as follows:

$$\begin{aligned} \ I_{morphed} = {\mathcal {T}}(I_{source},U_{x,y,z}) \end{aligned}$$

(1)

$$\begin{aligned} \ \mathcal {L}_{sim} = \frac{1}{\varOmega } \sum _{p \in \varOmega } (I_{target}(p) - I_{morphed}(p))^{2} \end{aligned}$$

(2)

where $\varOmega \subset \mathbb {R}^{3}$ and ${\mathcal {T}}$ describes the spatial transforming operator that morphs $I_{source}$ using $U_{x,y,z}$.

Anatomical Constraints. In order to enforce realistic cardiac motion patterns, we incorporate some anatomical constraints. Cardiac motion fields should be generally smooth. That is, there should be no discontinuities or jumps within the motion field. To discourage this behavior, we penalize the $L^{2}$-norm of the spatial derivatives in a manner similar to [8, 10] as follows:

$$\begin{aligned} \ \mathcal {L}_{smooth} = \frac{1}{\varOmega } \sum _{p \in \varOmega } \Vert \nabla {U_{x,y,z}(p)}\Vert _{2}^{2} \end{aligned}$$

(3)

Additionally, it is expected that the LV myocardium preserves its general shape through time. In order to enforce this notion, a shape constraint is added which morphs the manual segmentation of $I_{source}$ using $U_{x,y,z}$ and compares it against the manual segmentation of $I_{target}$ in a manner similar to [11] as follows:

$$\begin{aligned} \ \mathcal {L}_{shape} = (1 - \frac{2 \mid S_{target} \cap {\mathcal {T}}(S_{source},U_{x,y,z}) \mid }{\mid S_{target} \mid + \mid {\mathcal {T}}(S_{source},U_{x,y,z}) \mid }) \end{aligned}$$

(4)

where $S_{source}$ and $S_{target}$ are the manual segmentations of $I_{source}$ and $I_{target}$, respectively. We can then define the full motion loss as:

$$\begin{aligned} \ \mathcal {L}_{motion} = \lambda _{sim}\mathcal {L}_{sim} + \lambda _{smooth}\mathcal {L}_{smooth} + \lambda _{shape}\mathcal {L}_{shape} \end{aligned}$$

(5)

3.2 Segmentation Branch

The objective of segmentation is to assign labels to voxels in order to delineate objects of interest from background. The segmentation branch of our proposed model follows the same 3D U-Net architectural style of the motion branch [15]. A downsampling analysis path shares weights with the motion branch and separates to a segmentation-specific upsampling synthesis path. The goal of this branch is to minimize a combined dice and binary cross entropy loss between predicted segmentations and manual segmentations as follows:

$$\begin{aligned} \ \mathcal {L}_{dice} = (1 - \frac{2 \mid S\cap M \mid }{\mid S \mid + \mid M \mid }) \end{aligned}$$

(6)

$$\begin{aligned} \ \mathcal {L}_{bce} = -(y_{i}log(P_{i}) + (1-y_{i})(log(1-P_{i})) \end{aligned}$$

(7)

$$\begin{aligned} \ \mathcal {L}_{seg} = \lambda _{dice} \mathcal {L}_{dice} + \lambda _{bce} \mathcal {L}_{bce} \end{aligned}$$

(8)

where M is the model prediction of the LV mask, S is the manually traced ground truth mask. y is a binary indicator for if a voxel is correctly labeled, and P is the predicted probability a voxel is part of the LV mask.

3.3 Shared Feature Encoder

Multi-task learning is a popular method for combining closely related tasks in a unified framework. In this work, we adopt a soft parameter sharing approach [18]. Inspired by the success of [2] on 2D MR images, we employ a similar Siamese-style model (using a 3D U-Net style architecture with residual blocks) for feature encoding by sharing the weights between the downsampling analysis path of the motion estimation and segmentation branches. These shared weights are then concatenated to each task-specific upsampling synthesis path (feature decoding), thereby using features learned in both tasks to influence the final output. In this way, the model allows each branch to exploit the complementary latent representations for each task during training. Both branches are trained simultaneously and optimized using a composite loss function, weighted by $\alpha $ and $\beta $:

$$\begin{aligned} \ \mathcal {L}_{total} = \alpha \mathcal {L}_{motion} + \beta \mathcal {L}_{seg} \end{aligned}$$

(9)

4 Experiments and Results

4.1 Datasets and Evaluation

In vivo studies were conducted on 8 anesthetized open-chested canines, which were each imaged under 5 conditions: healthy baseline, mild LAD stenosis, mild LAD stenosis with low-dose dobutamine (5 $\upmu $ g/kg/min), moderate LAD stenosis, and moderate LAD stenosis with low-dose dobutamine [19]. Images were captured using a Philips iE33 scanner with an X7-2 probe. In total, we had 40 3D echocardiographic sequences which we then sampled into image pairs to be inputted to the network. Image pairs were sampled in a 1-to-Frame manner. In other words, for each sequence, we used the first time frame (which roughly corresponds to end-diastole) as $I_{source}$ and all subsequent time frames as $I_{target}$. All experiments conducted in support of this work were approved under Institutional Animal Care and Use Committee policies.

Due to the scarcity of true cardiac motion, quantitative evaluation of motion performance is often done by comparing propagated labels [25]. In this work, we employ a similar evaluation strategy by warping the endocardial (endo) and epicardial (epi) contours of the source mask and evaluating the mean contour distance (mcd) from the expected (manually traced) target mask contours. We compare our model against a conventional motion estimation approach (denoted as Optical flow - as formulated in [20]) as well as a state-of-the-art deep learning based model (denoted as Motion only - which resembles the VoxelMorph framework described in [10]). Results displayed on Fig. 2 and Table 1a suggest that the proposed model performs favorably against the alternative methods. Wilcoxon rank sum test indicates significant increase in performance $(p < 0.05)$ of the proposed model [24].

To evaluate the segmentation predictions, we compare the Jaccard index and Hausdorff distance (HD) between model predictions and manually traced segmentations [22]. We evaluate the performance of the proposed model against a segmentation specific branch without feature sharing (denoted as Segmentation only - which resembled the 3D U-Net architecture described in [15]). Additionally, since both the segmentation and motion branches of the proposed model produce segmentation predictions as either part of their main task or in support of the shape constraint, we report the average values from these predictions. Results displayed on Fig. 3 and Table 1b suggest that the proposed model performs favorably in predicting LV myocardium segmentation. Wilcoxon rank sum test indicates significant improvement in performance $(p < 0.05)$ of the proposed for both Jaccard and HD over the segmentation only model [24].

4.2 Implementation Details

Of the 8 canine studies, we set aside 1 entire study, which consisted of all 5 conditions, for testing. Of the 7 remaining studies, we randomly divided the image pairs such that 90% were used for training and 10% were used for validation and parameter searching. The acquired images were resampled from their native ultrasound resolutions so that each voxel corresponded to 1 mm$^{3}$. During training and testing, the images were further resized to $64^{3}$ due to computational limitations. Images were resized back to 1 mm$^{3}$ resolution prior to evaluation. An Adam optimizer with a learning rate of $1e^{-4}$ was used. The model was trained with a batch size of 1 for 50 epochs. Due to the small batch size, group normalization was used in place of standard batch normalization [23]. Hyperparameters and loss weights were empirically selected. The model was developed using PyTorch and trained on an NVIDIA GeForce RTX 2080 Ti. Pre and post processing were done using MATLAB 2019b.

Table 1. Quantitative evaluation. (a) Lower mcd indicates better performance. (b) Higher Jaccard (up to 1) and lower HD indicates better performance

Full size table

5 Conclusion

In this paper, we proposed a novel multi-task learning architecture that can simultaneously estimate 3D motion and predict volumetric LV myocardial segmentations in 3D echocardiography. This is accomplished through a weight-sharing feature encoder that is capable of learning latent representations in the data that is mutually beneficial to both tasks. Anatomical constraints are incorporated during training in order to encourage realistic cardiac motion patterns. Evaluations using an in vivo canine dataset suggest that our model performs favorably when compared to single task learning and other alternative methods. Future work includes further evaluation such as cross-validation using our existing dataset or validation against a larger or different dataset. Furthermore, we will explore the potential clinical applications of our model in estimating cardiac strain and detecting and localizing myocardial ischemia.

References

Cheng, J., et al.: SegFlow: joint learning for video object segmentation and optical flow. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Qin, C., et al.: Joint learning of motion estimation and segmentation for cardiac MR image sequences. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 472–480. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_53
Chapter Google Scholar
Ta, K., Ahn, S.S., Stendahl, J.C., Sinusas, A.J., Duncan, J.S.: A semi-supervised joint network for simultaneous left ventricular motion tracking and segmentation in 4D echocardiography. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12266, pp. 468–477. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59725-2_45
Chapter Google Scholar
Chen, C., et al.: Deep learning for cardiac image segmentation: a review. Front. Cardiovasc. Med. 7, 25 (2020)
Google Scholar
Huang, X., et al.: Contour tracking in echocardiographic sequences via sparse representation and dictionary learning. Med. Image Anal. 18, 253–271 (2014)
Google Scholar
Dong, S., et al.: A combined fully convolutional networks and deformable model for automatic left ventricle segmentation based on 3D echocardiography. BioMed Res. Int. 2018, 1–16 (2018)
Google Scholar
Papademetris, X., et al.: Estimation of 3-D left ventricular deformation from medical images using biomechanical models. IEEE Trans. Med. Imaging 21, 786–800 (2002)
Google Scholar
Parajuli, N., et al.: Flow network tracking for spatiotemporal and periodic point matching: applied to cardiac motion analysis. Med. Image Anal. 55, 116–135 (2019)
Google Scholar
Qiu, H., Qin, C., Le Folgoc, L., Hou, B., Schlemper, J., Rueckert, D.: Deep learning for cardiac motion estimation: supervised vs. unsupervised training. In: Pop, M., et al. (eds.) STACOM 2019. LNCS, vol. 12009, pp. 186–194. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39074-7_20
Chapter Google Scholar
Balakrishnan, G., et al.: An unsupervised learning model for deformable medical image registration. In: The IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Zhu, W., et al.: NeurReg: neural registration and its application to image segmentation. In: Winter Conference on Applications of Computer Vision (2020)
Google Scholar
Ahn, S.S., et al.: Unsupervised motion tracking of left ventricle in echocardiography. In: Medical Imaging 2020: Ultrasonic Imaging and Tomography, International Society for Optics and Photonics (2020)
Google Scholar
Ta, K., et al.: A semi-supervised joint learning approach to left ventricular segmentation and motion tracking in echocardiography. In: IEEE International Symposium on Biomedical Imaging (2020)
Google Scholar
Tsai, Y.-H., et al.: Video segmentation via object flow. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Chapter Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Lu, A., et al.: Learning-based regularization for cardiac strain analysis with ability for domain adaptation. arXiv preprint arXiv:1807.04807 (2018)
Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv:1706.05098 (2017)
Stendahl, J.C., et al.: Regional myocardial strain analysis via 2D speckle tracking echocardiography: validation with sonomicrometry and correlation with regional blood flow in the presence of graded coronary stenoses and dobutamine stress. Cardiovasc. Ultrasound. 18(1), 2 (2020). PMID: 31941514; PMCID: PMC6964036. https://doi.org/10.1186/s12947-019-0183-x
Besnerais, G.L., et al.: Dense optical flow by iterative local window registration. In: IEEE International Conference on Image Processing (2005)
Google Scholar
He, K., et al.: Deep residual learning for image recognition. arXiv:1512.03385 (2015)
Yushkevich, P.A., et al.: User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31, 1116–1128 (2006)
Google Scholar
Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_1
Chapter Google Scholar
Gibbons, J.D., et al.: Nonparametric Statistical Inference, 5th edn. Chapman & Hall/CRC Press, Taylor & Francis Group, Boca Raton (2011)
Google Scholar
Yu, H., et al.: FOAL: fast online adaptive learning for cardiac motion estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020
Google Scholar

Download references

Acknowledgements

This work was funded by the following grants: R01 HL121226, R01 HL137365, and HL 098069. Additionally, we are grateful toward Drs. Nripesh Parajuli and Allen Lu and the Fellows and staff of the Yale Translational Research Imaging Center, especially Drs. Nabil Boutagy and Imran Al Khalil, for their technical support and assistance with the in vivo canine imaging studies.

Author information

Authors and Affiliations

Department of Biomedical Engineering, Yale University, New Haven, CT, USA
Kevinminh Ta, Shawn S. Ahn & James S. Duncan
Department of Internal Medicine, Yale University, New Haven, CT, USA
John C. Stendahl & Albert J. Sinusas
Department of Electrical Engineering, Yale University, New Haven, CT, USA
James S. Duncan
Department of Radiology and Biomedical Imaging, Yale University, New Haven, CT, USA
Jonathan Langdon, Albert J. Sinusas & James S. Duncan

Authors

Kevinminh Ta
View author publications
You can also search for this author in PubMed Google Scholar
Shawn S. Ahn
View author publications
You can also search for this author in PubMed Google Scholar
John C. Stendahl
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Langdon
View author publications
You can also search for this author in PubMed Google Scholar
Albert J. Sinusas
View author publications
You can also search for this author in PubMed Google Scholar
James S. Duncan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kevinminh Ta .

Editor information

Editors and Affiliations

King’s College London, London, UK
Esther Puyol Antón
Sunnybrook Research Institute, Toronto, Canada
Mihaela Pop
Universitat de Barcelona, BCN-AIM Artificial Intelligence in Medicine Lab, Barcelona, Spain
Carlos Martín-Isla
Inria - Epione Group, Sophia Antipolis, France
Maxime Sermesant
King’s College London, London, UK
Avan Suinesiaputra
Pompeu Fabra University, Barcelona, Spain
Oscar Camara
University of Barcelona, Barcelona, Spain
Karim Lekadir
King’s College London, London, UK
Alistair Young

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ta, K., Ahn, S.S., Stendahl, J.C., Langdon, J., Sinusas, A.J., Duncan, J.S. (2022). Simultaneous Segmentation and Motion Estimation of Left Ventricular Myocardium in 3D Echocardiography Using Multi-task Learning. In: Puyol Antón, E., et al. Statistical Atlases and Computational Models of the Heart. Multi-Disease, Multi-View, and Multi-Center Right Ventricular Segmentation in Cardiac MRI Challenge. STACOM 2021. Lecture Notes in Computer Science(), vol 13131. Springer, Cham. https://doi.org/10.1007/978-3-030-93722-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-93722-5_14
Published: 14 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93721-8
Online ISBN: 978-3-030-93722-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Simultaneous Segmentation and Motion Estimation of Left Ventricular Myocardium in 3D Echocardiography Using Multi-task Learning