Keywords

1 Introduction

Reductions in myocardial blood flow due to coronary artery disease (CAD) can result in myocardial ischemia or infarction and subsequent regional myocardial dysfunction. Echocardiography provides a non-invasive and cost-efficient tool for clinicians to visually analyze left ventricle (LV) wall motion and assess regional dysfunction in the myocardium. However, such a qualitative method of analysis is subjective by nature and, as a result, is prone to high intra- and interobserver variability.

Motion estimation algorithms provide an objective method for characterizing myocardial contractile function, while segmentation algorithms assist in localizing regions of ischemic tissue. Traditional approaches treat these tasks as unique steps and solve them separately, but recent efforts in the medical image analysis and computer vision fields suggest that these two tasks may be mutually beneficial when optimized simultaneously [1,2,3].

In this paper, we propose a multi-task deep learning network to simultaneously segment the LV and estimate its motion between time frames in a 3D echocardiographic (3DE) sequence. The main contributions of this work are as follows: 1) We introduce a novel multi-task learning architecture with residual blocks to solve both 3D motion estimation and volumetric segmentation using a weight sharing feature encoder with task-specific decoding branches; 2) We incorporate anatomically inspired constraints to encourage realistic cardiac motion estimation; 3) We apply our proposed model to 3DE sequences which typically provides additional challenges over Magnetic Resonance (MR) and Computed Tomography (CT) due to lower signal-to-noise ratio as well as over 2DE due to higher dimensionality and smaller spatial and temporal resolution.

2 Related Works

Classic model-based segmentation methods such as active shape and appearance models usually require a priori knowledge or large amounts of feature engineering to achieve adequate results [4, 5]. In recent years, data-driven deep learning based approaches have shown promising results, but still suffer challenges due to inherent ultrasound image properties such as low signal-to-noise ratio and low image contrast [4, 6]. This becomes increasingly problematic as cardiac motion estimation approaches often rely on accurate segmentations to act as anatomical guides for surface and shape tracking or for the placement of deformable grid points [7, 8]. Errors in segmentation predictions can be propagated to motion estimations. While it is conceivable that an expert clinician can manually segment or adjust algorithm predictions prior to the motion estimation step, this is a tedious and infeasible workaround. Several deep learning based approaches have been shown to be successful in estimating motion in the computer vision field, but the difficulty in obtaining true LV motion in clinical data makes supervised approaches challenging. Unsupervised approaches which seek to maximize intensity similarity between image pairs have been successful in MR and CT, however there are still limited applications in 3DE [9,10,11,12].

In recent years, efforts have been made to combine the tasks of motion estimation and segmentation. Qin et al. [2] proposed a Siamese-style joint encoder network using a VGG-16 based architecture that demonstrates promising results when applied to 2D MR cardiac images. The work done in [13] adapts this idea to 2D echocardiography by adopting a feature bridging framework [1] with anatomically inspired constraints. This is further expanded to 3DE in [3] through the use of an iterative training approach where results from one task influence the training of the other [14]. In this work, we propose an alternative novel framework for combining motion estimation and segmentation in 3DE that uses a shared feature encoder to exploit complementary latent representations in the data, which the iterative style of [3] is not capable of doing. In addition to being able to estimate 3D motion and predict volumetric LV segmentations, our model further differs from [2] through the implementation of a 3D U-Net-style architecture [15] with residual blocks [21] for the encoding and decoding branches as opposed to a VGG-16 and FCN architecture (Fig. 1).

Fig. 1.
figure 1

The proposed network and its components. (A) Motion estimation and segmentation tasks are coupled in a multi-task learning framework. (B) An overview of the residual block.

3 Methods

3.1 Motion Estimation Branch

Motion estimation algorithms aim to determine the voxel-wise displacement between two sequential images. Given a source image \(I_{source}\) and a target image \(I_{target}\), motion estimation algorithms can be described by their formulation of the mapping function F such that \(F(I_{source},I_{target}) \rightarrow U_{x,y,z}\) where \(U_{x,y,z}\) is the displacement along the x-y-z directions. Supervised deep learning formulations of F seek to directly learn the regression between the image pairs and ground truth displacement fields. However due to the scarcity of ground truth in cardiac motion, the motion branch of our network is designed and trained in an unsupervised manner similar to the framework presented in [10], which utilizes a spatial transformation to maximize a similarity metric between a warped source image and a target image.

Our proposed motion branch is comprised of a 3D U-Net inspired architecture with a downsampling analysis path followed by an upsampling synthesis path [15, 16]. Skip connections are used to branch features learned in the analysis path with features learned in the synthesis path. Our model also uses residual blocks in order to improve model performance and training efficiency [21]. The downsampling analysis path serves as a feature encoder which shares its weights with the segmentation branch. The input to the motion branch is a pair of 3D images, \(I_{source}\) and \(I_{target}\). The branch outputs a displacement field \(U_{x,y,z}\) which describes the motion from \(I_{source}\) to \(I_{target}\). The displacement field is then used to morph \(I_{source}\) to match \(I_{target}\) as described in [10]. The objective of the network is to maximize the similarity between the morphed \(I_{source}\) and \(I_{target}\) by minimizing the mean squared error between each corresponding voxel p in the two frames. This can be described as follows:

$$\begin{aligned} \ I_{morphed} = {\mathcal {T}}(I_{source},U_{x,y,z}) \end{aligned}$$
(1)
$$\begin{aligned} \ \mathcal {L}_{sim} = \frac{1}{\varOmega } \sum _{p \in \varOmega } (I_{target}(p) - I_{morphed}(p))^{2} \end{aligned}$$
(2)

where \(\varOmega \subset \mathbb {R}^{3}\) and \({\mathcal {T}}\) describes the spatial transforming operator that morphs \(I_{source}\) using \(U_{x,y,z}\).

Anatomical Constraints. In order to enforce realistic cardiac motion patterns, we incorporate some anatomical constraints. Cardiac motion fields should be generally smooth. That is, there should be no discontinuities or jumps within the motion field. To discourage this behavior, we penalize the \(L^{2}\)-norm of the spatial derivatives in a manner similar to [8, 10] as follows:

$$\begin{aligned} \ \mathcal {L}_{smooth} = \frac{1}{\varOmega } \sum _{p \in \varOmega } \Vert \nabla {U_{x,y,z}(p)}\Vert _{2}^{2} \end{aligned}$$
(3)

Additionally, it is expected that the LV myocardium preserves its general shape through time. In order to enforce this notion, a shape constraint is added which morphs the manual segmentation of \(I_{source}\) using \(U_{x,y,z}\) and compares it against the manual segmentation of \(I_{target}\) in a manner similar to [11] as follows:

$$\begin{aligned} \ \mathcal {L}_{shape} = (1 - \frac{2 \mid S_{target} \cap {\mathcal {T}}(S_{source},U_{x,y,z}) \mid }{\mid S_{target} \mid + \mid {\mathcal {T}}(S_{source},U_{x,y,z}) \mid }) \end{aligned}$$
(4)

where \(S_{source}\) and \(S_{target}\) are the manual segmentations of \(I_{source}\) and \(I_{target}\), respectively. We can then define the full motion loss as:

$$\begin{aligned} \ \mathcal {L}_{motion} = \lambda _{sim}\mathcal {L}_{sim} + \lambda _{smooth}\mathcal {L}_{smooth} + \lambda _{shape}\mathcal {L}_{shape} \end{aligned}$$
(5)

3.2 Segmentation Branch

The objective of segmentation is to assign labels to voxels in order to delineate objects of interest from background. The segmentation branch of our proposed model follows the same 3D U-Net architectural style of the motion branch [15]. A downsampling analysis path shares weights with the motion branch and separates to a segmentation-specific upsampling synthesis path. The goal of this branch is to minimize a combined dice and binary cross entropy loss between predicted segmentations and manual segmentations as follows:

$$\begin{aligned} \ \mathcal {L}_{dice} = (1 - \frac{2 \mid S\cap M \mid }{\mid S \mid + \mid M \mid }) \end{aligned}$$
(6)
$$\begin{aligned} \ \mathcal {L}_{bce} = -(y_{i}log(P_{i}) + (1-y_{i})(log(1-P_{i})) \end{aligned}$$
(7)
$$\begin{aligned} \ \mathcal {L}_{seg} = \lambda _{dice} \mathcal {L}_{dice} + \lambda _{bce} \mathcal {L}_{bce} \end{aligned}$$
(8)

where M is the model prediction of the LV mask, S is the manually traced ground truth mask. y is a binary indicator for if a voxel is correctly labeled, and P is the predicted probability a voxel is part of the LV mask.

3.3 Shared Feature Encoder

Multi-task learning is a popular method for combining closely related tasks in a unified framework. In this work, we adopt a soft parameter sharing approach [18]. Inspired by the success of [2] on 2D MR images, we employ a similar Siamese-style model (using a 3D U-Net style architecture with residual blocks) for feature encoding by sharing the weights between the downsampling analysis path of the motion estimation and segmentation branches. These shared weights are then concatenated to each task-specific upsampling synthesis path (feature decoding), thereby using features learned in both tasks to influence the final output. In this way, the model allows each branch to exploit the complementary latent representations for each task during training. Both branches are trained simultaneously and optimized using a composite loss function, weighted by \(\alpha \) and \(\beta \):

$$\begin{aligned} \ \mathcal {L}_{total} = \alpha \mathcal {L}_{motion} + \beta \mathcal {L}_{seg} \end{aligned}$$
(9)
Fig. 2.
figure 2

Motion estimations from end-disatole to end-systole for healthy baseline canines. From left to right for both (a) and (b): optical flow, motion only, proposed model.

4 Experiments and Results

4.1 Datasets and Evaluation

In vivo studies were conducted on 8 anesthetized open-chested canines, which were each imaged under 5 conditions: healthy baseline, mild LAD stenosis, mild LAD stenosis with low-dose dobutamine (5 \(\upmu \) g/kg/min), moderate LAD stenosis, and moderate LAD stenosis with low-dose dobutamine [19]. Images were captured using a Philips iE33 scanner with an X7-2 probe. In total, we had 40 3D echocardiographic sequences which we then sampled into image pairs to be inputted to the network. Image pairs were sampled in a 1-to-Frame manner. In other words, for each sequence, we used the first time frame (which roughly corresponds to end-diastole) as \(I_{source}\) and all subsequent time frames as \(I_{target}\). All experiments conducted in support of this work were approved under Institutional Animal Care and Use Committee policies.

Due to the scarcity of true cardiac motion, quantitative evaluation of motion performance is often done by comparing propagated labels [25]. In this work, we employ a similar evaluation strategy by warping the endocardial (endo) and epicardial (epi) contours of the source mask and evaluating the mean contour distance (mcd) from the expected (manually traced) target mask contours. We compare our model against a conventional motion estimation approach (denoted as Optical flow - as formulated in [20]) as well as a state-of-the-art deep learning based model (denoted as Motion only - which resembles the VoxelMorph framework described in [10]). Results displayed on Fig. 2 and Table 1a suggest that the proposed model performs favorably against the alternative methods. Wilcoxon rank sum test indicates significant increase in performance \((p < 0.05)\) of the proposed model [24].

To evaluate the segmentation predictions, we compare the Jaccard index and Hausdorff distance (HD) between model predictions and manually traced segmentations [22]. We evaluate the performance of the proposed model against a segmentation specific branch without feature sharing (denoted as Segmentation only - which resembled the 3D U-Net architecture described in [15]). Additionally, since both the segmentation and motion branches of the proposed model produce segmentation predictions as either part of their main task or in support of the shape constraint, we report the average values from these predictions. Results displayed on Fig. 3 and Table 1b suggest that the proposed model performs favorably in predicting LV myocardium segmentation. Wilcoxon rank sum test indicates significant improvement in performance \((p < 0.05)\) of the proposed for both Jaccard and HD over the segmentation only model [24].

4.2 Implementation Details

Of the 8 canine studies, we set aside 1 entire study, which consisted of all 5 conditions, for testing. Of the 7 remaining studies, we randomly divided the image pairs such that 90% were used for training and 10% were used for validation and parameter searching. The acquired images were resampled from their native ultrasound resolutions so that each voxel corresponded to 1 mm\(^{3}\). During training and testing, the images were further resized to \(64^{3}\) due to computational limitations. Images were resized back to 1 mm\(^{3}\) resolution prior to evaluation. An Adam optimizer with a learning rate of \(1e^{-4}\) was used. The model was trained with a batch size of 1 for 50 epochs. Due to the small batch size, group normalization was used in place of standard batch normalization [23]. Hyperparameters and loss weights were empirically selected. The model was developed using PyTorch and trained on an NVIDIA GeForce RTX 2080 Ti. Pre and post processing were done using MATLAB 2019b.

Fig. 3.
figure 3

Predicted left ventricular masks. From left to right for both (a) and (b): segmentation only, proposed model.

Table 1. Quantitative evaluation. (a) Lower mcd indicates better performance. (b) Higher Jaccard (up to 1) and lower HD indicates better performance

5 Conclusion

In this paper, we proposed a novel multi-task learning architecture that can simultaneously estimate 3D motion and predict volumetric LV myocardial segmentations in 3D echocardiography. This is accomplished through a weight-sharing feature encoder that is capable of learning latent representations in the data that is mutually beneficial to both tasks. Anatomical constraints are incorporated during training in order to encourage realistic cardiac motion patterns. Evaluations using an in vivo canine dataset suggest that our model performs favorably when compared to single task learning and other alternative methods. Future work includes further evaluation such as cross-validation using our existing dataset or validation against a larger or different dataset. Furthermore, we will explore the potential clinical applications of our model in estimating cardiac strain and detecting and localizing myocardial ischemia.