Keywords

1 Introduction

Semi-supervised and self-supervised representation learning with or without annotations have attracted significant attention across various medical imaging modalities [7, 10, 26,27,28]. These learning schemes are able to well exploit large-scale unlabeled medical datasets and learn meaningful representations for downstream task finetuning. In particular, contrastive representation learning based on instance discrimination tasks [6, 11] has become the leading paradigm for self-supervised pretraining, where a model is trained to pull together each instance and its augmented views and meanwhile push it away from those of all other instances in the embedding space.

Fig. 1.
figure 1

Illustration of different representation learning approaches for fetal ultrasound, (a) self-supervised contrastive learning, (b) contrastive learning with patient metadata, and (c) our proposed anatomy-aware contrastive learning. Icon shapes of circle (), square () and triangle () denote the anatomical categories of fetal head, profile, and abdomen, respectively. The anchor image is highlighted with a red bounding box, while the red dotted circle means pull together (Best viewed in colored version). (Color figure online)

However, directly applying self-supervised contrastive learning (e.g. SimCLR [6] and MoCo [11]) in the context of medical imaging may result in visual representations that are inconsistent in appearance and semantics. We illustrate this issue in Fig. 1(a), which shows that a vanilla contrastive learning approach without considering the domain-specific anatomical characteristics leads to false negatives, i.e. some negative samples having high affinity with the anchor image are “pushed away”. To address this, we explore the following question: Is domain-specific anatomy information helpful in learning better representations for medical data?

We investigate this question via the proposed anatomy-aware contrastive learning (AWCL), as depicted in Fig. 1(c), where “anatomy-aware” here means that the inclusion of anatomy information is leveraged to augment the positive/negative pair sampling in a contrastive learning manner. In this work, we demonstrate the proposed approach for fetal ultrasound imaging tasks, where a number of different fetal anatomies can be present in a diagnostic scan. Motivated by Khosla et al. [18], we expand the pool of positive samples by grouping images from the same or different ultrasound scans that share common anatomical categories. More importantly, our approach is optimized alternately with both conventional and anatomy-aware contrastive objectives, as shown in Fig. 2(a), given that the anatomy information is not always accessible for each sampling process. Moreover, we consider both coarse- and fine-grained anatomical categories with the availability for data sampling, as shown in Fig. 2(b) and (c). We also empirically investigate their effect on the transferability of the learned feature representations. To assess the effectiveness of our pre-trained representations, we evaluated transfer learning on three downstream clinical tasks: standard plane detection, segmentation of Crown Rump Length (CRL) and Nuchal Translucency (NT), and recognition of first-trimester anatomical structures. In summary, the main contributions and findings are:

  • We develop an anatomy-aware contrastive learning approach for medical fetal ultrasound imaging tasks.

  • We empirically compare the effect of inclusion of anatomy information with coarse- and fine-grained granularity respectively, within our contrastive learning approach. The comparative analysis suggests that contrastive learning with fine-grained anatomy information which preserves intra-class difference is more effective than its counterpart.

  • Experimental evaluations on three downstream clinical tasks demonstrate the better generalizability of our proposed approaches over learning from an ImageNet pre-trained ResNet, vanilla contrastive learning [6], and contrastive learning with patient information [2, 7, 25].

  • We provide an in-depth analysis to show the proposed approach learns high-quality discriminative representations.

2 Related Work

Self-supervised Learning (SSL) in Medical Imaging. Prior works using SSL for medical imaging typically selecting on designing pre-text tasks, such as solving a Rubik’ cube [28], image restoration [14, 27], predicting anatomical position [3] and multi-task joint reasoning [17]. Recently, contrastive based SSL [6, 11] has been favourably applied to learn more discriminative representations across various medical imaging tasks [7, 24, 26]. In particular, Sowrirajan et al. [24] successfully adapted a MoCo-contrastive learning method [11] into chest X-rays and demonstrated better transferable representations and initialization for chest X-ray diagnostic tasks. Taher et al. [13] presented a benchmark evaluation study to investigate the effectiveness of several established contrastive learning models pre-trained on ImageNet on a variety of medical imaging tasks. In addition, there have been recent approaches [2, 7, 25] that leverage patient metadata to improve the medical imaging contrastive learning. These approaches constrain the selection of positive pairs only from the same subject (video), with the assumption that visual representations from the same subject share similar semantic meaning. However, these approaches may not generalize well to a scenario, where different organs or anatomical structures are captured in a single video. For instance, as seen from Fig. 1(b), some positive pairs having low affinity in visual appearance and semantics are pulled together, i.e. false positives, which can degrade the representation learning. The proposed learning scheme, as shown in Fig. 1(c), is advantageous to address the aforementioned limitations by augmenting the sampling process with the inclusion of anatomy information. Moreover, our approach differs from [26] and [7] which combine label information as an additional supervision signal with self supervision for multi-tasking.

Representation Learning in Fetal Ultrasound. There are related works exploring representation learning for fetal ultrasound imaging tasks. Baumgartner et al. [4] and Schlemper et al. [22] proposed a VGG-based network and an attention-gated network respectively to detect fetal standard planes. Sharma et al. [23] presented a multi-stream network which combines 2D image and spatio-temporal information to automate clinical workflow description of full-length routine fetal anomaly ultrasound scans. Cai et al. [5] considered incorporating the temporal dimension into visual attention modelling via multi-task learning for standard biometry plane-finding navigation. However, the generalization and transferability of those models to other target tasks remains unclear. Droste et al. [8] proposed to learn transferable representations for fetal ultrasound interpretation by modelling sonographer visual attention (gaze tracking) without manual supervision. More recently, Jiao et al. [16] proposed to derive a meaningful representation from raw data by developing a cross-modal contrastive learning which aligns the correspondence between fetal ultrasound video and narrative speech audio. Our work differs by focusing on learning general image representations without requiring additional data modalities (e.g. gaze tracking and audio) from the domain of interest, and we also perform extensive experimental analysis on three downstream clinical tasks to assess the effectiveness of the learned representations.

3 Fetal Ultrasound Imaging Dataset

This study uses a large-scale fetal ultrasound imaging dataset, which was acquired as part of PULSE (Perception Ultrasound by Learning Sonographic Experience) project [9]. The scans were performed by operators including sonographers and fetal medicine specialists using a commercial Voluson E8 version BT18 (General Electric, Zipf, Austria) ultrasound machine. During a routine scanning session, the operator views several fetal or maternal anatomical structures. The frozen views saved by sonographers are referred to as standard planes in the paper, following the UK Fetal Anomaly Screening Programme (FASP) nomenclature [1].

Fetal ultrasound videos, recorded from the ultrasound scanner display using a lossless compression and sampled at the rate 30 Hz. We consider a subset of the entire ultrasound dataset for the proposed pre-training approach. This consists of total number of 2,050,432 framesFootnote 1 from 534 second-trimester ultrasound videos. In this sub-dataset, there are 15,384 frames labeled with 13 fine-grained anatomy categories, including four views of heart, three-vessel and trachea (3VT), four-chamber (4CH), right ventricular outflow tract (RVOT), and left ventricular outflow tract (LVOT), two views of brain, transventricular (BrainTv.) and transcerebellum (BrainTc.), two views of spine, coronal (SpineCor.) and sagittal (SpineSag.), abdomen, femur, kidneys, lips, profile and background class. In addition, 69,671 frames are labeled with coarse anatomy categories without dividing the heart, brain and spine into further sub-categories as those of above, but also 3D mode, maternal anatomy including Doppler, abdomen, nose and lips, kidneys, face-side profile, full-body-side profile, bladder including Doppler, femur and “Other” class. All image frames were preprocessed by cropping the ultrasound image region and resizing it to \(224\times 288\) pixels.

Fig. 2.
figure 2

(a) presents the overview of proposed anatomy-aware contrastive learning approach. (b) and (c) illustrate using coarse and fine-grained anatomy categories, respectively for the proposed AWCL framework. Icon shapes of white-circle (), grey-circle (), square () and triangle () denote the classes of coronal view of spine, sagittal view of spine, profile, and abdomen, respectively.

4 Method

In this section, we first describe the problem formulation of contrastive learning with medical images, and then present our anatomy-aware contrastive learning algorithm design as well as training details.

4.1 Problem Formulation

For each input image \(\textbf{x}\) in a mini-batch of N samples, randomly sampled from a pre-training dataset \(\mathcal {V}\), a contrastive learning framework (i.e. SimCLR [6]) applies two augmentations to obtain a positive pair \((\tilde{\textbf{x}}_{i}, \tilde{\textbf{x}}_{j})\), yielding a set of 2N samples. Let i denote the anchor input index, the contrastive learning objective can be defined as,

$$\begin{aligned} L_{C}^{i}=-\log \frac{\exp \left( {\text {sim}}\left( \textbf{z}_{i}, \textbf{z}_{j}\right) / \tau \right) }{\sum _{k=1}^{2\,N} \textbf{1}_{[k \ne i]} \exp \left( {\text {sim}}\left( \textbf{z}_{i}, \textbf{z}_{k}\right) / \tau \right) }, \end{aligned}$$
(1)

where \(\textbf{1}\in \{0,1\}\), \(\tau \) is a temperature parameter and \(sim(\cdot )\) is the pairwise cosine similarity. \(\textbf{z}\) is a representation vector, calculated by \(\textbf{z}= g(f(\textbf{x}))\), where \(f(\cdot )\) denotes a shared encoder modelled by a convolutional neural network (CNN) and \(g(\cdot )\) is a multi-layer perception (MLP) projection head.

The above underpins the vanilla contrastive learning. However in some cases (e.g. ultrasound scan as illustrated in this paper), this standard approach, as well as its extended version that leverages patient information [2, 7, 25], may lead to false negatives and false positives respectively, as seen from Fig. 1(a) and (b). To this end, we introduce a new approach as detailed next.

4.2 Anatomy-Aware Contrastive Learning

Figure 1(c) illustrates the main idea of the new anatomy-aware contrastive learning (AWCL) approach, which incorporates additional samples belonging to the same anatomy category from the same or different US scans. In addition to positive sampling from the same image and its augmentation, AWCL is tailored to the case where multiple anatomical structures are present.

As shown in Fig. 2(a), we utilize the available anatomy information as detailed in Sect. 3, forming a positive sample set \(\mathcal {A}(i)\) with the same anatomy as sample i. The assumption for such a design is that image samples within the same anatomy category should have similar appearances, based on a clinical perspective [9]. Motivated by [18], we design the anatomy-aware contrastive learning objective as follows,

$$\begin{aligned} L_{A}^{i}=-\frac{1}{|\mathcal {A}(i)|}\sum _{a\in \mathcal {A}(i)}\log \frac{\exp \left( {\text {sim}}\left( z_{i}, z_{a}\right) / \tau \right) }{\sum _{k=1}^{2\,N} \textbf{1}_{[k \ne i]} \exp \left( {\text {sim}}\left( z_{i}, z_{k}\right) / \tau \right) }, \end{aligned}$$
(2)

where \(|\mathcal {A}(i)| \) denotes the cardinality.

Due to the limited availability of some anatomical categories, \(\mathcal {A}(i)\) is not always achievable for each sampling process. In this regard, the AWCL framework is formulated as an alternate optimization combining both learning objectives of Eq. 1 and Eq. 2. This gives a loss function defined as

$$\begin{aligned} L^{i}= {\left\{ \begin{array}{ll} L_{C}^{i}&{} \text{ if } |\mathcal {A}(i)| = 0 \\ L_{A}^{i}&{} \text{ if } |\mathcal {A}(i)| > 0. \end{array}\right. } \end{aligned}$$
(3)

Furthermore, we consider both coarse- and fine-grained anatomical categories for the proposed AWCL framework, and compare their effect on the transferability of visual representations. Figure 2(b) and (c) shows the motivation of this comparative analysis. For an anatomical structure with different views of visual appearance (e.g. the spine has two views as sub-classes), we observe that AWCL with coarse-grained anatomy information tends to minimize the intra-class difference by pulling together all the instances of the same anatomy. In contrast, AWCL with fine-grained anatomy information tends to preserve the intra-class difference by pushing away images with different visual appearances despite the same anatomy. Both strategies of the proposed learning approach are evaluated and compared in Sect. 6.3. We further study the impact of the ratio of anatomy information used in AWCL pre-training in Sect. 6.4.

figure h

4.3 Implementation Details

Algorithm 1 provides the pseudo-code of AWCL. Following the prior art [7, 24, 25], we use ResNet-18 [12] as our backbone architecture. Further studies on different network architectures are out of scope of this paper. We split the pre-training dataset as detailed in Sect. 3 into training and validation sets (80%/20%), and train the model using the Adam optimizer with a weight decay of \(10^{-6}\), and a mini-batch size of 32. We follow [6] for the data augmentations applied to the sampled training data. The output feature dimension of z is set to 128. The temperature parameter \(\tau \) is set as 0.5. The models are trained with the loss functions defined earlier (Eq. 2 and Eq. 1) for 10 epochs. The learning rate is set as \(10^{-3}\). The whole framework is implemented with the PyTorch [21] framework on a PC with NVIDIA Titan V GPU card. The code is available at https://github.com/JianboJiao/AWCL.

To demonstrate the effectiveness of AWCL trained models, we compare them with random initialization, ImageNet pre-trained ResNet18 [12], supervised pre-training with coarse labels, supervised pre-training with fine-grained labels, vanilla contrastive learning (SimCLR) [6], and contrastive learning with patient information (CLPI) [2, 7, 19]. All pre-training methods presented here are pre-trained from scratch on the pre-training dataset with the similar parameter configurations as listed above.

5 Experiments on Transfer Learning

In this section, we evaluate the effectiveness of the SSL pre by supervised transfer learning with end-to-end fine-tuning on three downstream clinical tasks, which are second-trimester standard plane detection (Task I), recognition of first-trimester anatomies (Task II) and segmentation of NT and CRL (Task III). The datasets for downstream task evaluation are listed in Table 1, and are independent datasets from [9]. For fair comparison, all compared pre-training models were fine-tuned with the same parameter settings and data augmentation policies within each downstream task evaluation.

5.1 Evaluation on Standard Plane Detection

Evaluation Details. Here, we investigate how the pre-trained representations generalize to an in-domain second-trimester classification task, which consists of the same fine-grained anatomical categories as detailed in Sect. 3. We fine-tune each trained backbone encoder and attach a classifier head [4] to train the entire network for 70 epochs with a learning rate of 0.01, decayed by 0.1 at epochs 30 and 55. The network training is performed via SGD with momentum of 0.9, weight decay of \(5\times 10^{-4}\), mini-batch size of 16 and a cross-entropy loss, and it is evaluated via a three-fold cross validation. The augmentation policy used is analogous to [8], including random horizontal flipping, rotation (10\(^\circ \)), and varying gamma and brightness. We employ precision, recall and F1-scores computed as macro-averages as the evaluation metrics.

Table 1. Details of the downstream datasets and imaging tasks.
Table 2. Quantitative comparison of fine-tuning performance (mean ± std. [%]) on the tasks of standard plane detection (Task I), first-trimester anatomy recognition (Task II) and CRL / NT segmentation (Task III). Best results are marked in bold.
Fig. 3.
figure 3

Illustration of the confusion matrix for the first-trimester classification task.

Results and Discussion. Table 2 shows a quantitative comparison of fine-tuning performance for the three evaluated downstream tasks. From the results of Task I, we observe that AWCL pre-trained models, i.e. AWCL (coarse) and AWCL (fine-grained), generally outperform the compared contrastive learning methods SimCLR and CLPI. In particular, AWCL (coarse) improves on SimCLR and CLPI by 1.9% and 3.8% in F1-score, respectively. Compared to the supervised pre-training methods, both AWCL approaches achieve better performance in Recall and F1-score than vanilla supervised pre-training with coarse-grained labels. These findings suggest that incorporating anatomy information to select positive pairs from multiple scans can notably improve representation learning.

However, we find that all the contrastive pre-training approaches presented here underperform the supervised pre-training (fine-grained) which has the same form of semantic supervision as Task I. This suggests that without explicitly encoding semantic information, contrastively learned representations may provide limited benefits to the generalization of a fine-grained multi-class classification task, which is line with the findings in [15].

5.2 Evaluation on Recognition of First-Trimester Anatomies

Evaluation Details. We investigate how the pre-trained representations generalize to a cross-domain classification task using the first-trimester US scans. This first-trimester classification task recognises five anatomical categories: crown rump length(CRL), nuchal translucency (NT), biparietal diameter (BPD), 3D and background (Bk). We split the data into training and testing sets (78%/22%). The trained encoders followed by two fully-connected layers and a softmax layer were fine-tuned for 200 epochs with a learning rate of 0.1 decayed by 0.1 at 150 epochs. The network was trained using SGD with momentum of 0.9. Standard data augmentation was used, including rotation \([-30^{\circ }, 30^{\circ }]\), horizontal flip, Gaussian noise, and shear \({\le }0.2\). Batch size was adjusted according to model size and GPU memory restrictions. We use the same metrics as presented in Task I for performance evaluation.

Results and Discussion. For Task II, we see from Table 2, that AWCL (fine-grained) achieves the best performance among all the compared solutions. In particular, it achieves a performance gain of 4.9%, 3.4% and 5.0% in Precision, Recall and F1-score compared to ImageNet pre-training, and even improves on supervised pre-training with fine-grained labels (the upper-bound baseline) by 0.7% in F1-score. Moreover, AWCL (coarse) also surpasses ImageNet and supervised pre-training with coarse-grained labels by 1.9% and 6.3% in F1-score. For comparison with other contrastive learning methods, we observe a similar improved trend as described in Task I, i.e. AWCL (coarse) and AWCL (fine-grained) perform better than SimCLR and CLPI. Further evidence is provided in Fig. 3, which shows that both AWCL (coarse) and AWCL (fine-grained) provide better prediction accuracy than CLPI for all anatomy categories. These experimental results again demonstrate the effectiveness of AWCL approaches and suggest that the inclusion of anatomy information in contrastive learning is a good practice when it is available at hand.

5.3 Evaluation on Segmentation of NT and CRL

Evaluations Details. In this section, we evaluate how the pre-trained models generalize to a cross-domain segmentation task with the data from the first-trimester US scans. Segmentation of NT and CRL was defined as a three-class segmentation task with the three classes being; mid-sagittal view, nuchal translucency, background. The data is divided into training and testing with 80%/20%. We follow the design of ResNet-18 auto-encoder by attaching additional decoders with the trained encoders, and then fine-tuned the entire model for 50k iterations with a learning rate of 0.001, RMSprop optimization (momentum=0.9) and a weight decay of 0.001. We apply random scaling, random shifting, random rotation, and random horizontal flipping for data augmentation. We use global average accuracy (GAA), mean accuracy (MA), and mean intersection over union (mIoU) metrics for evaluating the segmentation task (Task III).

Fig. 4.
figure 4

Illustration of the qualitative results for the first-trimester segmentation task.

Results and Discussion. For Task III, we find that AWCL (fine-grained), achieves comparable or slightly better performance than supervised pre-training with fine-grained labels and surpasses other compared pre-training methods by large margins in mIoU (see Table 2). In particular, it outperforms ImageNet and SimCLR by 13.8% and 7.1% in mIoU, respectively. Likewise, AWCL (coarse) performs better than ImageNet, supervised pre-training with coarse-grained labels, SimCLR and CLPI by large margins in most evaluation metrics. Figure 4 also visualizes the superior performance of AWCL (fine-grained) and AWCL (coarse) compared to SimCLR and CLPI, which aligns with the quantitative evaluation. These observations suggest that AWCL are able to learn more meaningful semantic representations that are beneficial for this pixel-wise segmentation task. Overall, the evaluated results on Tasks II and III demonstrate that the AWCL models report consistently better performance than the compared pre-trained models, implying the advantage of learning task-agnostic features that can better generalized to the tasks from different domains.

6 Analysis

6.1 Partial Fine-Tuning

To analyze representation quality, we extract fixed feature representations from the last layer of the ResNet-18 encoder and then evaluate them in two classification target tasks (Task I and Task II). Experimentally, we freeze the entire backbone encoder and attach a classification head [4] for Task I, and a non-linear classifier as mentioned in Sect. 5.2 for Task II. From Table 3, we observe that the AWCL approaches show better representation quality by surpassing the three compared approaches in terms of F1-score for both tasks. This suggests that the learned representations are strong non-linear features which are more generalizable and transferable to the downstream tasks. Comparing Tables 2 and 3, we find that although the reported scores of partial fine-tuning are generally lower than for full fine-tuning, the performance between two implementations of transfer learning is correlated.

Table 3. Performance comparison of partial fine-tuning (mean ± std. [%]) on the tasks of standard plane detection (Task I) and first-trimester anatomy recognition (Task II). Best results are marked in bold.

6.2 Visualization of Feature Representations

In this section we investigate why the feature representations produced with AWCL pre-trained models result in better downstream task performance. We visualize the image representations of Task II extracted from the penultimate layers using t-SNE [20] in Fig. 5, where different anatomical categories are denoted with different color. We compare the resulting t-SNE embeddings of AWCL models with those as SimCLR and CLPI. We observe that the feature representation by CLPI is not quite separable, especially for classes of NT () and CRL (). The features embeddings from SimCLR are generally better separated than those in CLPI, while confusion between CRL () and Bk () remains. By comparison, AWCL (fine-grained) achieves the best separated clusters among five anatomical categories, which means that the learned representations in the embedding space are more distinguishable. These visualization results demonstrate that AWCL approaches are able to learn discriminative feature representations which are better generalized to downstream tasks.

Fig. 5.
figure 5

t-SNE feature visualization of the model penultimate layers on Task II.

6.3 Impact of Data Granularity on AWCL

We analyze how the inclusion of coarse- and fine-grained anatomy information impact the AWCL framework, by comparing the experimental results between AWCL (coarse) and AWCL (fine-grained) from Sect. 5.1 to Sect. 6.2. Based on the transfer learning results in Table 2, we find that AWCL (fine-grained) achieves better performance than AWCL (coarse) for Tasks II and III, despite the slight performance drop in Task I. We hypothesize that AWCL (coarse) learns more generic representations than AWCL (fine-grained), which leads to better in-domain generalization. Qualitative results in Fig. 3 and Fig. 4 also reveal the advantage of AWCL (fine-grained) over its counterpart. Based on the ablation analysis, Table 3 shows a similar finding as seen in Table 2. Figure 6 shows that feature embeddings of AWCL (fine-grained) are more discriminative than those of AWCL (coarse) thereby resulting in better generalization to downstream tasks. These observations suggest the importance of learning intra-class feature representations for better generalization to downstream tasks especially when there is a domain shift.

Fig. 6.
figure 6

Impact of anatomy ratio on AWCL (fine-grained) evaluated on Task II.

6.4 Impact of Anatomy Ratio on AWCL

We investigate how varying anatomy ratios impact the AWCL framework. Note that higher anatomy ratio represents that larger number of samples from same or different US scans belonging to the same anatomy category are included to form positive pairs for contrastive learning. We incorporated the anatomy information with four different ratios: 10%, 30%, 50%, and 80% to train the AWCL (fine-grained) models on the pre-training dataset, respectively. Then, we evaluate these trained models on Task II via full fine-tuning. As shown in Fig. 6, we observe that the performance improves with an increasing anatomy ratio. It suggests that using more distinct but anatomically similar samples to compose positive pairs results in a better quality representation.

7 Conclusion

In this paper, we presented a new anatomy-aware contrastive learning (AWCL) approach for fetal ultrasound imaging tasks. The proposed approach is able to leverage more positive samples from the same or different US videos with the same anatomy category and align well with the anatomical characteristics of ultrasound videos. The feature representative analysis shows AWCL approaches learn discriminative representations that can be better generalized to downstream tasks. Through the reported comparative study, AWCL with fine-grained anatomy information which preserves intra-class difference was more effective than its counterpart. Experimental evaluations demonstrate that our AWCL approach provides useful transferable representations for various downstream clinical tasks, especially for cross-domain generalization. The proposed approach can be potentially applied to other medical imaging modalities where such anatomy information is available.