Keywords

1 Introduction

Echocardiography is widely and routinely used for assessing heart function and for the diagnosis of several conditions, such as heart failure and coronary artery disease [13]. In a routine echocardiographic study, multiple views of the heart are obtained to show different parts of the heart’s internal structure, i.e. the ventricles, atria and valves—see Fig. 1. However, not all views are used in subsequent analysis of the echocardiograms depending on the cardiac function being assessed or the type of disease being investigated [13]. Therefore, an important initial step in any automated analysis pipeline is the accurate detection of standardised cardiac views shown on each echocardiogram. Frequently, further analysis—usually performed with proprietary analysis software—focuses on left ventricular function [17]. Often only the three apical views of the heart are assessed, which show slices through the left ventricle. However, it is still important for a view classifier to be aware of the entire cardiac anatomy so that it does not misclassify views it has not been trained on. This is challenging because it requires large training datasets with appropriate labels. Furthermore, when assessing certain cardiac conditions, the injection of a microbubble contrast agent is used to better highlight the boundaries of the left ventricular wall [20]. This changes the image appearance completely and effectively inverts the image contrast. Hence, these views cannot be classified without contrast enhanced data also being labelled for model training. The ability to correctly classify contrast images thus requires double the labelling effort.

Fig. 1.
figure 1

Examples of different echocardiography views used including the 2/3/4/5 chamber apical, parasternal long-axis (PLAX), short-axis (SAX) at papillary muscle, right ventricular (RV) and suprasternal notch (SSN) views. The top row shows images obtained after injection of a microbubble contrast agent, causing a near inversion in image contrast, whereas the lower two rows show non-contrast images.

View classification on echocardiographic data has previously been achieved using convolutional networks [8, 18, 22] that take as input an image and predict one of the possible views that were present in the training label set for that network. For the commonly acquired echocardiographic views, such as the apical four-chamber view, labelled data for model training is available even in some public datasets [14, 19]. However, for less commonly acquired views, with or without contrast enhancement, it is time-consuming and expensive to acquire labels and thus, datasets are often highly imbalanced. To tackle data imbalance, training classifiers may require under-sampling the majority classes and specialised cost functions [10] or augmentations with synthetically generated data [1].

In this paper, we investigate the problem of view classification in cardiac ultrasound images and attempt to improve the classification accuracy of convolutional neural networks, especially on under-represented classes, with the use of contrastive learning. Contrastive learning is a pre-training methodology, which improves learning of features useful for classification tasks through a contrastive loss. The contrastive loss clusters similar images together (positive pairs) and pushes different images away (negative pairs). This can be entirely based on self supervision for example when positive pairs consist of different augmented version of an image (SimCLR [6]) or, when in addition to augmentations, positive pairs also use supervision to include images of the same label (SupCon [11]). This has proven successful in computer vision tasks for instance for ImageNet sample classification [6].

Furthermore, although cardiac ultrasound data consist of videos, view classification is typically performed per-frame as a 2D classification problem. For videos, unsupervised contrastive learning, such as SimCLR, is not directly applicable as also discussed in [7]: if multiple frames of the same video end up in the same batch, then the negative pairs of a frame will include other frames of the same video. This would hinder the ability of the contrastive loss to only cluster similar images together, since different video frames would generate a higher loss value. We therefore adopt the supervised contrastive loss [11], which does not suffer from this limitation. Our contributions are the following: (a) we apply contrastive classification neural networks to cardiac ultrasound, and (b) we evaluate in a dataset of contrast and non-contrast enhanced echocardiographic images collated from public and proprietary sources and show improved results when using the proposed contrastive framework for views which have fewer labelled training observations.

2 Related Work

Standard plane/view detection has been previously studied in fetal ultrasound with supervised deep learning models, such as SonoNet [2], multi-scale DenseNet [12], and convolutional networks finetuned with transfer learning [5] or trained with additional tasks to predict attention maps and adversarial training [3]. In echocardiography, inception [18] and VGG [22] networks have been used to predict several views or subclasses of views, although not applied on contrast echo data. Typically, contrast-enhanced images are used in isolation, for example to extract myocardial segmententations [15, 16]. Most recently, high view classification accuracy was reported by a convolutional network applied on mixed microbubble contrast-enhanced and non-contrast data from a multi-vendor site [8].

Given sufficiently large datasets, supervised training of convolutional networks is successful in accurate view detection. However, network initialisation is important to facilitate convergence, and therefore pre-training methods using self-supervision with different augmented views of the same image [6] or labels [11] are investigated to improve computer vision classification tasks, such as on the ImageNet dataset. Contrastive learning has also been used in the medical domain, for instance to improve segmentation performance on MRI images [4] or to learn joint representations of ultrasound videos and speech [9].

Fig. 2.
figure 2

Schematic of the baseline and contrastive models. (a) The baseline model architecture consists of a fully convolutional encoder and a fully connected classifier, and is trained with full supervision. (b) The contrastive model pre-trains the encoder using a projection network and a contrastive loss. The contrastive loss considers positive pairs if these are different augmentations of the same image or belong to the same class, and negative pairs otherwise.

3 Methodology

Given an image x of view \(y_k\), where \(k \in [1,13]\), corresponding to 13 classes of commonly acquired views with or without contrast, we consider a 2D baseline classification neural network c(x) to detect per-frame view labels. This network maps input images through five convolutional blocks, each containing two convolutional layers followed by batch normalisation and a ReLU activation function, and a max pooling layer, to a vector representation, which is then processed by two fully connected layers to generate a view label prediction. This model architecture, which was used in an eight-class form in [8], is designed so that it is sufficiently small and effective on standard view classification and can be seen in Fig. 2a. Training is performed with the categorical cross entropy loss described as follows:

$$ L_{view}=-\sum _{k=1}^{13} y_k log(c(x)). $$

A contrastive learning framework is then implemented as per the SupCon [11] methodology as follows: we split the baseline model into a fully convolutional and a fully connected sub-model, which are used as an encoder f(.) and classification network h(.), respectively so that \(c=h \circ f\). We add a projection network g(.), which projects the encoded features \(z=f(x)\) into a representation \(\hat{x}=g(z)\). The projection \(\hat{x}\) is used as an input to the contrastive loss that pre-trains the encoder. Finally, the classification network h(.) learns a mapping of the encoded features to their corresponding labels and is trained on a second stage following the encoder pre-training, whilst keeping the encoder weights fixed. A schematic of the framework is shown in Fig. 2.

The contrastive learning process is more formally described as: given N randomly augmented images \(\{x_i\}_{i=1}^N\), we first obtain a batch of 2N images \(B=\{1\ldots 2N\}\) by applying a second augmentation. For every image \(x_i\) in the batch, and its projection \(\hat{x_i}=g(f(x_i))\), there are also \(M_i\) other images of the same label in the set \(P_i=\{x_j\}_{j=1}^{M_i}\). According to [11] the supervised contrastive loss is defined as:

$$ L_{supcon} = \sum _{i\in B}-log\left\{ \frac{1}{M_i}\sum _{j \in P_i} \frac{exp(\hat{x}_i \hat{x}_j / \tau )}{\sum _{\alpha \in B \backslash i} exp(\hat{x}_i \hat{x}_{\alpha } / \tau )} \right\} , $$

where \(\tau \) controls the temperature scaling of the softmax. We set \(\tau =1000\) as per [11] and use brightness and contrast augmentations, as well as rotations to 30\(^\circ \) and spatial translations at up to 10% of the image dimensions.

Table 1. Description of the training and test dataset.

3.1 Data

The dataset used in this work comprised of anonymised 2D echocardiograms from multiple sites. The dataset is composed of data from EVAREST [21], a multi-site, multi-vendor UK trial, some data from the EchoNet public dataset [19], and some proprietary data from other imaging sites. The final dataset is split into a training and a test set of echocardiograms corresponding to 1,538 and 359 subjects, respectively. The total number of image frames contained in these data is 327,019 for the training set and 57,648 for the test set. Each echocardiographic video was labelled into one of 13 classes, which cover a set of standard cardiac views with or without microbubble contrast. The classes are shown in the first and second columns of Table 1 along with the number of subjects, echocardiograms and images present for each view.

Images were extracted from DICOM or AVI files and were pre-processed to remove all text information and annotations outside the ultrasound sector, so that the dataset contains only the images within the ultrasound sector.

As part of the EVAREST trial data, the dataset contains echocardiograms obtained with the patient at rest and with patients subjected to exercise or pharmacological stress. Heart rates vary from 45 to 150 and the number of heartbeats per scan are between one and three. The inclusion of stress echo data ensures that a range of image qualities is present in the dataset as stress echocardiograms tend to include images of poor image quality.

4 Experiments and Discussion

4.1 Experimental Setup

Prior to being fed into the network, image frames are resized to \(192\times 192\) pixel size, z-score normalised, and rescaled to [0, 1] range. The model and pipeline was developed in Python 3.7.7 with Tensorflow 2.2 and training was performed on four Nvidia GeForce RTX 2080 Ti GPUs with 11 GB VRAM each.

The baseline and contrastive learning methods were trained using Adam with batch size 64Footnote 1 and learning rate equal to 0.0001 on a 8-fold cross-validation with the validation set containing 10% of the training dataset’s echocardiograms. Training stopped using an early stopping criterion based on the validation set.

We train models using all 13 view classes in two scenarios: one using all data, and then one with reduced data of around 50 echocardiograms per class, chosen at random. We report the mean F1 score, precision and recall across the different validation splits and a held out test set that is common across the different splits.

Table 2. Classification results (mean and standard deviation) of baseline and contrastive models on validation (taken from 10% of the training set) and test sets using two datasets containing all data and 50 echocardiograms per class, respectively.

4.2 Classification Performance

Table 2 shows the mean and standard deviation of F1 score, precision and recall for the experiments on the full and reduced datasets. Both methods perform equally well on the dataset of 50 echocardiograms per class, which is balanced. We observe an improvement in test F1 score on the full dataset, which increases from 0.874 to 0.892, and smaller standard deviations in precision and recall.

Table 3 reports the per-class test F1 score for the two datasets. When assessing the per-class classifier performance, it can be seen that the contrastive training has minimal effect for the model trained on 50 echocardiograms per class. When training on the full dataset, classes which have a larger number of training data show similar or marginal improvement in performance in the test set. However, classes with substantially less training data, such as the contrast PLAX view, non-contrast 5-chamber view, and the non-contrast right ventricular (RV) view show greater improvement when using contrastive learning. The non-contrast suprasternal notch (SSN) view shows a 4% reduction but both baseline and contrastive model accuracies are very high.

Table 3. Classification results (mean and standard deviation) per class. The first column indicates whether the images have contrast or not. Results show the F1 score on the test set for two experiments using different training set sizes, with the number of studies for each view shown. Highest differences are marked in bold.

4.3 Ablation Studies and Failure Cases

We perform two ablation experiments on the model parameters. Firstly, we evaluate the effect of batch size by testing values equal to 32 and 16. The obtained results are the same as the ones achieved with batch size 64. Although it has been reported that large batch sizes benefit contrastive learning [6], since more positive and negative examples occur in a batch, at this value range the effect is minimal. GPU memory limitations prevented experiments with higher values.

We also experiment with different sets of augmentations. The experiments in Sect. 4.2 use random rotations, translations, as well as changes in brightness and contrast. Random crops resulting in images of \(140\times 140\) pixel size have also been tested. However, training with such crop augmentations decreased the validation F1 score of the contrastive model by approximately 15%. This can be attributed to the fact that in view classification, cropped ultrasound images might generate images which appear similar to other views.

Finally, Fig. 3 shows a selection of cases for which the baseline model fails, but for some the contrastive model is able to predict correctly. In all cases, the incorrect view is visually similar to the true view (for example, the apical 4 and 5 chamber views are very similar) so it is evident why the models would struggle. The contrastive model is likely more successful with these challenging views as it creates a better decision boundary between classes.

Fig. 3.
figure 3

Selection of failure cases. The baseline model fails on all these, but SupCon correctly classifies the examples in the top row.

5 Conclusion

We have shown that the use of contrastive learning applied to echocardiographic view classification can improve accuracy and reduce standard deviation of the classifier for views for which far less training data is available, with no reduction in overall performance. This indicates that contrastive learning could be a powerful tool in developing models for analysing medical images without requiring such intensive collection and labelling of very large datasets.

We leave as future work testing the effect of different contrastive losses on diverse datasets potentially including unlabelled data, as well as studying the effect of design biases introduced by different encoder architectures on the quality of the learnt latent representations.