Keywords

1 Introduction

The analysis of 2D transthoracic echocardiograms is crucial in clinical cardiology for disease diagnosis and treatment selection [2]. The analysis comprises the extraction of a number of quantitative markers of cardiac function, such as the ejection fraction (EF) and the chamber volumes [10]. Extraction of these quantitative markers requires accurate and precise delineation of the cardiac anatomy. However, manual expert annotation is a time-consuming task associated with high inter- and intra-rater variability [1]. Existing commercial solutions allow semi- or fully-automatic delineation of the cardiac structures, but they are typically limited to the segmentation of the end-diastolic (ED) and end-systolic (ES) frames [14].

The focus on ED and ES frames is also reflected in most published research utilizing machine learning approaches [12]. As these methods require large and diverse datasets for training, collecting annotations of full sequences has not been the prime focus. The most commonly used public datasets for echocardiography segmentation, CAMUS [8] and EchoNet-Dynamic [11], provide manual labelsFootnote 1 for the ED and ES frames only. Therefore, most current state-of-the-art (SoTA) segmentation methods rely solely on expert annotations for these two frames [12]. Despite achieving performance within the margins of intra-observer variability [15, 18], these methods do not address the smooth evolution of the cardiac structures over time, leading to temporally inconsistent predictions [12].

Since preserving the temporal consistency of the segmentations is beneficial for precise EF estimation [18], several studies have addressed this issue. Some approaches combine temporal and multi-view information using 3D CNN and convolutional LSTM [9]. Others enforce temporal smoothness through post-processing [12] or leverage optical flow for segmentation accuracy improvement [3, 21]. Wei et al. introduced CLAS, an end-to-end approach that combines co-learning of appearance and shape features with the generation of left ventricle (LV) pseudo-labels for the intermediate time points [18]. These LV pseudo-labels are obtained by warping the ground truth maps to other frames using optical flow. Chen et al. further added data augmentation (A-CLAS) [4], while Wei et al. introduced two auxiliary tasks, view classification and EF regression, and proposed the multi-task version of CLAS (MCLAS) [19].

Although these methods achieve temporally consistent segmentation, their reliance on co-learning and pseudo-labels makes them computationally complex. Moreover, their constrained end-to-end nature restricts their modularity. In contrast, we present a method that addresses pseudo-label generation and temporally smooth segmentation as separate components. It leverages an unsupervised image registration model to sequentially estimate the deformations between frames and generate pseudo-labels through the warping of the available segmentation maps. The generated pseudo-labels allow supervised training of arbitrary 3D (2D+time) segmentation networks. To this end, we train a 3D nnU-Net [7] to delineate the LV cavity, LV myocardium and left atrium. We evaluate the proposed approach on the public CAMUS dataset [8], demonstrating that it generates reliable pseudo-labels that bring significant benefits to the downstream segmentation task. The segmentation model exhibits remarkable accuracy in delineating cardiac structures while preserving spatiotemporal smoothness, ultimately yielding accurate EF estimations.

Fig. 1.
figure 1

The proposed image registration-based pseudo-labels generation method. The provided segmentations are propagated from ED to ES (a) and from ES to ED (e). The masks from the two directions are aggregated as described in Sect. 2.1 and weighted according to a sinusoidal function (b and d).

2 Method

To obtain accurate and temporally consistent 3D (2D+time) segmentations from a sparsely labeled dataset, the method first generates the pseudo-labels for those frames that lack reference segmentations. This is done through the sequential application of image registration. Thereafter, the method uses these pseudo-labels to augment sparse reference annotations and train a segmentation model.

2.1 Pseudo-labels Generation

Echocardiography acquisition consists of a sequence of image frames \(x_t, \; \forall t \in \{1, 2, .., N\}\) showing the evolution of the heart over the cardiac cycle. Given the reference segmentation for the ED and ES frames, unsupervised deformable image registration (DIR) is exploited to segment the frames lacking segmentation masks. The registration’s dense displacement vector field (DVF) is employed to warp the segmentation of frame \(x_t\) (\(y_t\)) to frame \(x_{t+1}\), resulting in a pseudo-segmentation \(\overrightarrow{y}_{t+1}\) of frame \(x_{t+1}\). Specifically, the available ED segmentation is iteratively forward-propagated through the sequence to produce \(\overrightarrow{y}_t, \; \forall t \in \{1, 2, .., N\}\). Akin, backward-propagating the ES segmentation mask returns a set of \(\overleftarrow{y}_t, \; \forall t \in \{1, 2, .., N\}\).

To mitigate error accumulation caused by sequential registrations, the two sets of pseudo-labels \(\overrightarrow{\textbf{y}}\) and \(\overleftarrow{\textbf{y}}\) are combined using a weighted average of their class-wise signed distance maps. Specifically, for each class and time point, a binary mask is extracted and the signed distance to its edges is computed. The resulting distance maps, \(d(\overrightarrow{y}_{t,C})\) and \(d(\overleftarrow{y}_{t,C})\), are then weighted-averaged to return an image with negative values outside the object, positive values inside and zero crossings at the object boundaries. Thresholding this image at zero produces the final mask. The final bidirectional method is illustrated in Fig. 1 and defined mathematically in Eq. 1:

$$\begin{aligned} \tilde{y}_{t,C} = \left( d(\overrightarrow{y}_{t,C}) \cdot \cos ^2{\frac{\pi }{2N}t} + d(\overleftarrow{y}_{t,C}) \cdot \sin ^2{\frac{\pi }{2N}t} \right) > 0 \end{aligned}$$
(1)

where \(\overleftrightarrow {y}_{t,C}\) is the binary mask corresponding to class C at time point t, \(d(\cdot )\) is the distance transform operation and N is the ED-to-ES sequence length. The weights are determined according to the temporal proximity of \(d(\overrightarrow{y}_{t,C})\) and \(d(\overleftarrow{y}_{t,C})\) to the ED and ES reference segmentations, respectively. More specifically, they are designed to decrease from 1 to 0 in the direction of the propagation, thereby exerting more influence on the forward direction at the beginning of the sequence and on the backward direction at the end. This further mitigates error accumulation and improves the accuracy of the object representation.

In this work, an unsupervised deep learning registration framework is utilized to perform image alignment through CNNs [6]. The method exploits image similarity between fixed and moving image pairs, B-splines as the transformation model, and supports coarse-to-fine alignment. Additionally, the loss function combines the negative normalized cross correlation \(\mathcal {L}_{NCC}\) with the bending energy penalty P: \(\mathcal {L} = \mathcal {L}_{NCC} + \alpha P\) [13]. The regularization term P minimizes the second order derivative of local transformations, thereby enforcing global smoothness and preventing anatomically implausible image folding.

2.2 Segmentation

The reference segmentations of the echocardiograms are augmented with the pseudo-labels to provide densely labeled reference sequences. This enables the training of 3D (2D+time) segmentation models, which are designed to be trained on densely annotated data. By encoding the time dimension as the third dimension in convolutional space, a 3D model can learn spatiotemporal features that encourage temporally smooth predictions. To this end, a 3D nnU-Net is trained on the augmented dataset (3D Dense nnU-Net) [7].

2.3 Evaluation

Both the generated pseudo-labels and the predicted segmentations are intrinsically evaluated by overlap and boundary metrics: the DICE coefficient (DC), the mean absolute surface distance (MAD) and the 2D Hausdorff Distance (HD). The metrics are calculated per frame and subsequently averaged over an entire video. Additionally, the segmentation models are evaluated extrinsically through quantification of EF and LV volumes at end-diastole and end-systole, EDV and ESV. To aggregate dataset-level statistics for these indices, the correlation coefficient, bias and mean absolute error (MAE) are calculated between the reference and automatically obtained values. Finally, the temporal consistency of the automatic segmentation is assessed by tracking the area of a given class over time. The smoothness of a sequence is computed as the integral of the second derivative of the resulting curve (area curve). To account for changes in the slope of the area curve and to prevent the loss of information due to opposite bending, the second derivative is squared prior to integration. The final smoothness metric is defined in Eq. 2, with N being the ED-to-ES sequence length and \(a_C(t)\) the area of class C at time point t.

$$\begin{aligned} \text {Smoothness} = \int _{1}^{N} \left( a_C''(t)\right) ^2 dt, \end{aligned}$$
(2)

3 Experiments

Two main experiments were conductedFootnote 2. First, the pseudo-labels were generated and evaluated against reference segmentations. Second, the pseudo-labels were utilized to complement the original dataset and train the segmentation network.

All the models were implemented in PyTorch 1.12.1 and trained using 2 Intel Xeon Gold 6128 CPUs (6 cores, 3.40GHz) and a GeForce RTX 2080 Ti.

3.1 Data and Preprocessing

This study uses two public datasets: CAMUS [8] and TED [12]. CAMUS contains 2D echocardiograms with 2-chambers (2CH) and 4-chambers (4CH) views of half-cycle sequences (from ED to ES) of 500 patients (450 training, 50 test). Manual annotations of the LV cavity, LV myocardium and LA are provided for the ED and ES frames only. TED is a subset of CAMUS that comprises 98 full cycle 4CH sequences, with manual segmentations of the LV cavity and the LV myocardium for the whole cardiac cycle. 94 sequences are part of the CAMUS training set and 4 of the test set.

Prior to analysis, all images are resized to 512 \(\times \) 512 px, and the pixel spacing is scaled proportionally to preserve the anisotropic nature of the data.

3.2 Pseudo-label Generation

The DIR model was trained on the CAMUS training set after leaving out the overlapping 94 TED echocardiograms, resulting in a set of 806 echo sequences. Successively, the frame-wise alignment quality was evaluated against these 94 left-out TED sequences. The DIR network was trained on every intra-patient combination of two frames from the registration training set. The training was performed in 10,000 iterations and used a batch size of 32, the AMSGrad variant of the ADAM optimizer and a learning rate of \(10^{-3}\). Hyperparameters such as the size, the number of kernels and the B-spline grid spacing were determined in preliminary experiments by testing values between 2 and 128. Optimal results were obtained with 32 kernels of size 32 \(\times \) 32, a grid spacing of 32 and a regularization hyperparameter of 1.0 to prevent folding. Coarse-to-fine registration did not improve performance, hence simple one-stage alignment was employed.

Figure 2 demonstrates the performance of pseudo-label generation using different approaches. Pseudo-labels were compared with predictions from a SoTA 2D nnU-Net trained on the original sparsely labeled CAMUS dataset (2D Sparse nnU-Net). Figure 3 highlights the effectiveness of our label propagation method in generating temporally consistent pseudo-labeled segmentation maps, promoting coherent feature learning during the segmentation step.

3.3 Segmentation

The 3D Dense nnU-Net was trained and tested on the sparsely labeled CAMUS datasets augmented with pseudo-labels, allowing direct comparison with related works. In addition, the 3D Dense model was evaluated against two baselines: a 2D nnU-Net trained on the sparsely labeled CAMUS dataset (2D sparse nnU-Net) and a 2D nnU-Net trained on the augmented CAMUS dataset (2D Dense nnU-Net). Each nnU-Net was trained for 1,000 epochs, using 5-fold cross-validation with an interleaved test setup. After training, the framework automatically selected the best U-Net configuration. Finally, three SoTA CLAS-based methods [4, 18, 19] were included for comparison. The models were compared in terms of (i) accuracy of the LV cavity, LV myocardium and LA segmentation at ED and ES; (ii) estimation of EF, EDV, and ESV; (iii) temporal smoothness.

The average segmentation performance on the ED and ES frames of the test set is listed in Table 1; the results of the EDV, ESV and EF estimation are displayed in Table 2; the observed temporal consistency of frame-by-frame predictions is shown in Fig. 4; finally, the area curve of a test patient is depicted in Fig. 5 along with the corresponding ED and ES predictions.

Fig. 2.
figure 2

Comparison of the pseudo-labels quality in terms of geometric metrics evaluated on the densely annotated TED dataset.

Table 1. Average segmentation results at ED and ES on the (sparsely annotated) CAMUS test set. The intra-observer variability results (in blue) are taken from the official CAMUS website and are not provided for the left atrium. The best value per column is indicated in bold.
Fig. 3.
figure 3

Left atrium area over time from the pseudolabels of patient0010 (4CH).

Table 2. LV volume and EF estimation on the CAMUS test set. The intra-observer variability is indicated in blue, and the best column-wise value is displayed in bold.
Fig. 4.
figure 4

Temporal smoothness of the CAMUS test set predictions in terms of the metric from Eq. 2 (lower values, higher smoothness). Note the logarithmic y-axis.

Fig. 5.
figure 5

Evaluation of the temporal consistency on patient0002 from the test set. Top row: area curves. Bottom row: predictions at ED and ES. The green contours refers to the ground truth and the magenta outline is the prediction of the 3D Dense model.

4 Discussion and Conclusion

This paper presented a method for temporally consistent segmentation of echocardiography using sparsely labeled data. The method exploits pseudo-labels generated by the use of DIR to complement the original set of sparsely annotated frames and allow the training of a 3D nnU-Net.

The analysis of the generated pseudo-label revealed the benefits of bidirectional over unidirectional label propagation. Results on the subsequent ED and ES segmentation task demonstrate that exploiting the pseudo-labels retains or improves the performance of the model trained on the sparsely labeled dataset, thereby endorsing their quality for downstream applications. The geometric metrics show that all three evaluated models perform at least as well as the SoTA methods, achieving a level of accuracy on par with intra-observer variability. However, evaluation of the temporal smoothness showed that the 2D Dense model outperforms the 2D Sparse model and that the 3D Dense, in turn, outperforms both. For quantification of LV volumes, the 3D Dense model outperforms all SoTA methods with EDV and ESV values closely matching intra-observer variability. EF estimation, however, is less remarkable. Yet, we argue that our method’s very low bias and MAE akin to intra-rater variability advocate sufficiently good estimations of the measure.

A more notable limitation of our approach is its exclusive focus on the systolic function. Longer sequences can be analyzed by identifying and extracting the systolic phase from the entire heart cycle [4], but this would still preclude the characterization of the diastolic function, which is relevant to various heart diseases [16]. To this end, related studies have investigated the extraction of more meaningful temporal features [21] and the application of cyclical self-supervision [5]. As a direct extension of this work, future research could explore the efficacy of registering unlabeled frames to the same image (specifically, the ED or ES ground truth) as an alternative to the sequential approach. This could limit error accumulation and potentially extend our method to encompass full- or multi-cycle sequences. However, this may be detrimental to the temporal consistency of the pseudo-labels and thus to the downstream segmentation and quantification.

Figure 5 shows that 3D Dense model results in slightly offset quantitative indices from ground truth and 2D models, especially at ED and ES. Examination of other patients indicates that the model does not favor over- or under-segmentation. Rather, Fig. 5 suggests the presence of uncertain boundaries in the data. Disagreements between manual and automatic segmentations arise when the endocardium is occluded, or when the LV myocardium and/or the LA extend beyond the field of view. In these cases, the ambiguous position of the structures likely influences the creation of manual annotations. Accordingly, the ambiguity is reflected in the predictions of the models, resulting in the observed discrepancy. Future work could model this randomness in order to convey the reliability of a given estimation. Extensions of this study may also attempt to limit the aforementioned uncertainty, for instance by selectively choosing high-quality pseudo-labels for training, or by leveraging distinct loss functions (or weighting schemes) for ground truth and pseudo-labeled frames [17, 20].

In conclusion, our approach achieves accurate segmentation comparable to SoTA methods while offering remarkable temporal consistency. Unlike end-to-end frameworks such as CLAS [4, 18, 19], our approach separates pseudo-label generation and segmentation, offering flexibility and modularity.