Anatomy-Aware Contrastive Representation Learning for Fetal Ultrasound

Fu, Zeyu; Jiao, Jianbo; Yasrab, Robail; Drukker, Lior; Papageorghiou, Aris T.; Noble, J. Alison

doi:10.1007/978-3-031-25066-8_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13803))

Included in the following conference series:

European Conference on Computer Vision

3123 Accesses
4 Citations

Abstract

Self-supervised contrastive representation learning offers the advantage of learning meaningful visual representations from unlabeled medical datasets for transfer learning. However, applying current contrastive learning approaches to medical data without considering its domain-specific anatomical characteristics may lead to visual representations that are inconsistent in appearance and semantics. In this paper, we propose to improve visual representations of medical images via anatomy-aware contrastive learning (AWCL), which incorporates anatomy information to augment the positive/negative pair sampling in a contrastive learning manner. The proposed approach is demonstrated for automated fetal ultrasound imaging tasks, enabling the positive pairs from the same or different ultrasound scans that are anatomically similar to be pulled together and thus improving the representation learning. We empirically investigate the effect of inclusion of anatomy information with coarse- and fine-grained granularity, for contrastive learning and find that learning with fine-grained anatomy information which preserves intra-class difference is more effective than its counterpart. We also analyze the impact of anatomy ratio on our AWCL framework and find that using more distinct but anatomically similar samples to compose positive pairs results in better quality representations. Experiments on a large-scale fetal ultrasound dataset demonstrate that our approach is effective for learning representations that transfer well to three clinical downstream tasks, and achieves superior performance compared to ImageNet supervised and the current state-of-the-art contrastive learning methods. In particular, AWCL outperforms ImageNet supervised method by 13.8% and state-of-the-art contrastive-based method by 7.1% on a cross-domain segmentation task.

Z. Fu, J. Jiao and R. Yasrab—Equal contribution.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Knowledge-Guided Pretext Learning for Utero-Placental Interface Detection

Statistical Dependency Guided Contrastive Learning for Multiple Labeling in Prenatal Ultrasound

Localized Region Contrast for Enhancing Self-supervised Learning in Medical Image Segmentation

Keywords

1 Introduction

Semi-supervised and self-supervised representation learning with or without annotations have attracted significant attention across various medical imaging modalities [7, 10, 26,27,28]. These learning schemes are able to well exploit large-scale unlabeled medical datasets and learn meaningful representations for downstream task finetuning. In particular, contrastive representation learning based on instance discrimination tasks [6, 11] has become the leading paradigm for self-supervised pretraining, where a model is trained to pull together each instance and its augmented views and meanwhile push it away from those of all other instances in the embedding space.

However, directly applying self-supervised contrastive learning (e.g. SimCLR [6] and MoCo [11]) in the context of medical imaging may result in visual representations that are inconsistent in appearance and semantics. We illustrate this issue in Fig. 1(a), which shows that a vanilla contrastive learning approach without considering the domain-specific anatomical characteristics leads to false negatives, i.e. some negative samples having high affinity with the anchor image are “pushed away”. To address this, we explore the following question: Is domain-specific anatomy information helpful in learning better representations for medical data?

We investigate this question via the proposed anatomy-aware contrastive learning (AWCL), as depicted in Fig. 1(c), where “anatomy-aware” here means that the inclusion of anatomy information is leveraged to augment the positive/negative pair sampling in a contrastive learning manner. In this work, we demonstrate the proposed approach for fetal ultrasound imaging tasks, where a number of different fetal anatomies can be present in a diagnostic scan. Motivated by Khosla et al. [18], we expand the pool of positive samples by grouping images from the same or different ultrasound scans that share common anatomical categories. More importantly, our approach is optimized alternately with both conventional and anatomy-aware contrastive objectives, as shown in Fig. 2(a), given that the anatomy information is not always accessible for each sampling process. Moreover, we consider both coarse- and fine-grained anatomical categories with the availability for data sampling, as shown in Fig. 2(b) and (c). We also empirically investigate their effect on the transferability of the learned feature representations. To assess the effectiveness of our pre-trained representations, we evaluated transfer learning on three downstream clinical tasks: standard plane detection, segmentation of Crown Rump Length (CRL) and Nuchal Translucency (NT), and recognition of first-trimester anatomical structures. In summary, the main contributions and findings are:

We develop an anatomy-aware contrastive learning approach for medical fetal ultrasound imaging tasks.
We empirically compare the effect of inclusion of anatomy information with coarse- and fine-grained granularity respectively, within our contrastive learning approach. The comparative analysis suggests that contrastive learning with fine-grained anatomy information which preserves intra-class difference is more effective than its counterpart.
Experimental evaluations on three downstream clinical tasks demonstrate the better generalizability of our proposed approaches over learning from an ImageNet pre-trained ResNet, vanilla contrastive learning [6], and contrastive learning with patient information [2, 7, 25].
We provide an in-depth analysis to show the proposed approach learns high-quality discriminative representations.

2 Related Work

Self-supervised Learning (SSL) in Medical Imaging. Prior works using SSL for medical imaging typically selecting on designing pre-text tasks, such as solving a Rubik’ cube [28], image restoration [14, 27], predicting anatomical position [3] and multi-task joint reasoning [17]. Recently, contrastive based SSL [6, 11] has been favourably applied to learn more discriminative representations across various medical imaging tasks [7, 24, 26]. In particular, Sowrirajan et al. [24] successfully adapted a MoCo-contrastive learning method [11] into chest X-rays and demonstrated better transferable representations and initialization for chest X-ray diagnostic tasks. Taher et al. [13] presented a benchmark evaluation study to investigate the effectiveness of several established contrastive learning models pre-trained on ImageNet on a variety of medical imaging tasks. In addition, there have been recent approaches [2, 7, 25] that leverage patient metadata to improve the medical imaging contrastive learning. These approaches constrain the selection of positive pairs only from the same subject (video), with the assumption that visual representations from the same subject share similar semantic meaning. However, these approaches may not generalize well to a scenario, where different organs or anatomical structures are captured in a single video. For instance, as seen from Fig. 1(b), some positive pairs having low affinity in visual appearance and semantics are pulled together, i.e. false positives, which can degrade the representation learning. The proposed learning scheme, as shown in Fig. 1(c), is advantageous to address the aforementioned limitations by augmenting the sampling process with the inclusion of anatomy information. Moreover, our approach differs from [26] and [7] which combine label information as an additional supervision signal with self supervision for multi-tasking.

Representation Learning in Fetal Ultrasound. There are related works exploring representation learning for fetal ultrasound imaging tasks. Baumgartner et al. [4] and Schlemper et al. [22] proposed a VGG-based network and an attention-gated network respectively to detect fetal standard planes. Sharma et al. [23] presented a multi-stream network which combines 2D image and spatio-temporal information to automate clinical workflow description of full-length routine fetal anomaly ultrasound scans. Cai et al. [5] considered incorporating the temporal dimension into visual attention modelling via multi-task learning for standard biometry plane-finding navigation. However, the generalization and transferability of those models to other target tasks remains unclear. Droste et al. [8] proposed to learn transferable representations for fetal ultrasound interpretation by modelling sonographer visual attention (gaze tracking) without manual supervision. More recently, Jiao et al. [16] proposed to derive a meaningful representation from raw data by developing a cross-modal contrastive learning which aligns the correspondence between fetal ultrasound video and narrative speech audio. Our work differs by focusing on learning general image representations without requiring additional data modalities (e.g. gaze tracking and audio) from the domain of interest, and we also perform extensive experimental analysis on three downstream clinical tasks to assess the effectiveness of the learned representations.

3 Fetal Ultrasound Imaging Dataset

This study uses a large-scale fetal ultrasound imaging dataset, which was acquired as part of PULSE (Perception Ultrasound by Learning Sonographic Experience) project [9]. The scans were performed by operators including sonographers and fetal medicine specialists using a commercial Voluson E8 version BT18 (General Electric, Zipf, Austria) ultrasound machine. During a routine scanning session, the operator views several fetal or maternal anatomical structures. The frozen views saved by sonographers are referred to as standard planes in the paper, following the UK Fetal Anomaly Screening Programme (FASP) nomenclature [1].

Fetal ultrasound videos, recorded from the ultrasound scanner display using a lossless compression and sampled at the rate 30 Hz. We consider a subset of the entire ultrasound dataset for the proposed pre-training approach. This consists of total number of 2,050,432 frames^{Footnote 1} from 534 second-trimester ultrasound videos. In this sub-dataset, there are 15,384 frames labeled with 13 fine-grained anatomy categories, including four views of heart, three-vessel and trachea (3VT), four-chamber (4CH), right ventricular outflow tract (RVOT), and left ventricular outflow tract (LVOT), two views of brain, transventricular (BrainTv.) and transcerebellum (BrainTc.), two views of spine, coronal (SpineCor.) and sagittal (SpineSag.), abdomen, femur, kidneys, lips, profile and background class. In addition, 69,671 frames are labeled with coarse anatomy categories without dividing the heart, brain and spine into further sub-categories as those of above, but also 3D mode, maternal anatomy including Doppler, abdomen, nose and lips, kidneys, face-side profile, full-body-side profile, bladder including Doppler, femur and “Other” class. All image frames were preprocessed by cropping the ultrasound image region and resizing it to $224\times 288$ pixels.

4 Method

In this section, we first describe the problem formulation of contrastive learning with medical images, and then present our anatomy-aware contrastive learning algorithm design as well as training details.

4.1 Problem Formulation

For each input image $\textbf{x}$ in a mini-batch of N samples, randomly sampled from a pre-training dataset $\mathcal {V}$, a contrastive learning framework (i.e. SimCLR [6]) applies two augmentations to obtain a positive pair $(\tilde{\textbf{x}}_{i}, \tilde{\textbf{x}}_{j})$, yielding a set of 2N samples. Let i denote the anchor input index, the contrastive learning objective can be defined as,

$$\begin{aligned} L_{C}^{i}=-\log \frac{\exp \left( {\text {sim}}\left( \textbf{z}_{i}, \textbf{z}_{j}\right) / \tau \right) }{\sum _{k=1}^{2\,N} \textbf{1}_{[k \ne i]} \exp \left( {\text {sim}}\left( \textbf{z}_{i}, \textbf{z}_{k}\right) / \tau \right) }, \end{aligned}$$

(1)

where $\textbf{1}\in \{0,1\}$, $\tau $ is a temperature parameter and $sim(\cdot )$ is the pairwise cosine similarity. $\textbf{z}$ is a representation vector, calculated by $\textbf{z}= g(f(\textbf{x}))$, where $f(\cdot )$ denotes a shared encoder modelled by a convolutional neural network (CNN) and $g(\cdot )$ is a multi-layer perception (MLP) projection head.

The above underpins the vanilla contrastive learning. However in some cases (e.g. ultrasound scan as illustrated in this paper), this standard approach, as well as its extended version that leverages patient information [2, 7, 25], may lead to false negatives and false positives respectively, as seen from Fig. 1(a) and (b). To this end, we introduce a new approach as detailed next.

4.2 Anatomy-Aware Contrastive Learning

Figure 1(c) illustrates the main idea of the new anatomy-aware contrastive learning (AWCL) approach, which incorporates additional samples belonging to the same anatomy category from the same or different US scans. In addition to positive sampling from the same image and its augmentation, AWCL is tailored to the case where multiple anatomical structures are present.

As shown in Fig. 2(a), we utilize the available anatomy information as detailed in Sect. 3, forming a positive sample set $\mathcal {A}(i)$ with the same anatomy as sample i. The assumption for such a design is that image samples within the same anatomy category should have similar appearances, based on a clinical perspective [9]. Motivated by [18], we design the anatomy-aware contrastive learning objective as follows,

$$\begin{aligned} L_{A}^{i}=-\frac{1}{|\mathcal {A}(i)|}\sum _{a\in \mathcal {A}(i)}\log \frac{\exp \left( {\text {sim}}\left( z_{i}, z_{a}\right) / \tau \right) }{\sum _{k=1}^{2\,N} \textbf{1}_{[k \ne i]} \exp \left( {\text {sim}}\left( z_{i}, z_{k}\right) / \tau \right) }, \end{aligned}$$

(2)

where $|\mathcal {A}(i)| $ denotes the cardinality.

Due to the limited availability of some anatomical categories, $\mathcal {A}(i)$ is not always achievable for each sampling process. In this regard, the AWCL framework is formulated as an alternate optimization combining both learning objectives of Eq. 1 and Eq. 2. This gives a loss function defined as

$$\begin{aligned} L^{i}= {\left\{ \begin{array}{ll} L_{C}^{i}&{} \text{ if } |\mathcal {A}(i)| = 0 \\ L_{A}^{i}&{} \text{ if } |\mathcal {A}(i)| > 0. \end{array}\right. } \end{aligned}$$

(3)

Furthermore, we consider both coarse- and fine-grained anatomical categories for the proposed AWCL framework, and compare their effect on the transferability of visual representations. Figure 2(b) and (c) shows the motivation of this comparative analysis. For an anatomical structure with different views of visual appearance (e.g. the spine has two views as sub-classes), we observe that AWCL with coarse-grained anatomy information tends to minimize the intra-class difference by pulling together all the instances of the same anatomy. In contrast, AWCL with fine-grained anatomy information tends to preserve the intra-class difference by pushing away images with different visual appearances despite the same anatomy. Both strategies of the proposed learning approach are evaluated and compared in Sect. 6.3. We further study the impact of the ratio of anatomy information used in AWCL pre-training in Sect. 6.4.

4.3 Implementation Details

Algorithm 1 provides the pseudo-code of AWCL. Following the prior art [7, 24, 25], we use ResNet-18 [12] as our backbone architecture. Further studies on different network architectures are out of scope of this paper. We split the pre-training dataset as detailed in Sect. 3 into training and validation sets (80%/20%), and train the model using the Adam optimizer with a weight decay of $10^{-6}$, and a mini-batch size of 32. We follow [6] for the data augmentations applied to the sampled training data. The output feature dimension of z is set to 128. The temperature parameter $\tau $ is set as 0.5. The models are trained with the loss functions defined earlier (Eq. 2 and Eq. 1) for 10 epochs. The learning rate is set as $10^{-3}$. The whole framework is implemented with the PyTorch [21] framework on a PC with NVIDIA Titan V GPU card. The code is available at https://github.com/JianboJiao/AWCL.

To demonstrate the effectiveness of AWCL trained models, we compare them with random initialization, ImageNet pre-trained ResNet18 [12], supervised pre-training with coarse labels, supervised pre-training with fine-grained labels, vanilla contrastive learning (SimCLR) [6], and contrastive learning with patient information (CLPI) [2, 7, 19]. All pre-training methods presented here are pre-trained from scratch on the pre-training dataset with the similar parameter configurations as listed above.

5 Experiments on Transfer Learning

In this section, we evaluate the effectiveness of the SSL pre by supervised transfer learning with end-to-end fine-tuning on three downstream clinical tasks, which are second-trimester standard plane detection (Task I), recognition of first-trimester anatomies (Task II) and segmentation of NT and CRL (Task III). The datasets for downstream task evaluation are listed in Table 1, and are independent datasets from [9]. For fair comparison, all compared pre-training models were fine-tuned with the same parameter settings and data augmentation policies within each downstream task evaluation.

5.1 Evaluation on Standard Plane Detection

Evaluation Details. Here, we investigate how the pre-trained representations generalize to an in-domain second-trimester classification task, which consists of the same fine-grained anatomical categories as detailed in Sect. 3. We fine-tune each trained backbone encoder and attach a classifier head [4] to train the entire network for 70 epochs with a learning rate of 0.01, decayed by 0.1 at epochs 30 and 55. The network training is performed via SGD with momentum of 0.9, weight decay of $5\times 10^{-4}$, mini-batch size of 16 and a cross-entropy loss, and it is evaluated via a three-fold cross validation. The augmentation policy used is analogous to [8], including random horizontal flipping, rotation (10$^\circ $), and varying gamma and brightness. We employ precision, recall and F1-scores computed as macro-averages as the evaluation metrics.

Table 1. Details of the downstream datasets and imaging tasks.

Full size table

Table 2. Quantitative comparison of fine-tuning performance (mean ± std. [%]) on the tasks of standard plane detection (Task I), first-trimester anatomy recognition (Task II) and CRL / NT segmentation (Task III). Best results are marked in bold.

Full size table

Results and Discussion. Table 2 shows a quantitative comparison of fine-tuning performance for the three evaluated downstream tasks. From the results of Task I, we observe that AWCL pre-trained models, i.e. AWCL (coarse) and AWCL (fine-grained), generally outperform the compared contrastive learning methods SimCLR and CLPI. In particular, AWCL (coarse) improves on SimCLR and CLPI by 1.9% and 3.8% in F1-score, respectively. Compared to the supervised pre-training methods, both AWCL approaches achieve better performance in Recall and F1-score than vanilla supervised pre-training with coarse-grained labels. These findings suggest that incorporating anatomy information to select positive pairs from multiple scans can notably improve representation learning.

However, we find that all the contrastive pre-training approaches presented here underperform the supervised pre-training (fine-grained) which has the same form of semantic supervision as Task I. This suggests that without explicitly encoding semantic information, contrastively learned representations may provide limited benefits to the generalization of a fine-grained multi-class classification task, which is line with the findings in [15].

5.2 Evaluation on Recognition of First-Trimester Anatomies

Evaluation Details. We investigate how the pre-trained representations generalize to a cross-domain classification task using the first-trimester US scans. This first-trimester classification task recognises five anatomical categories: crown rump length(CRL), nuchal translucency (NT), biparietal diameter (BPD), 3D and background (Bk). We split the data into training and testing sets (78%/22%). The trained encoders followed by two fully-connected layers and a softmax layer were fine-tuned for 200 epochs with a learning rate of 0.1 decayed by 0.1 at 150 epochs. The network was trained using SGD with momentum of 0.9. Standard data augmentation was used, including rotation $[-30^{\circ }, 30^{\circ }]$, horizontal flip, Gaussian noise, and shear ${\le }0.2$. Batch size was adjusted according to model size and GPU memory restrictions. We use the same metrics as presented in Task I for performance evaluation.

Results and Discussion. For Task II, we see from Table 2, that AWCL (fine-grained) achieves the best performance among all the compared solutions. In particular, it achieves a performance gain of 4.9%, 3.4% and 5.0% in Precision, Recall and F1-score compared to ImageNet pre-training, and even improves on supervised pre-training with fine-grained labels (the upper-bound baseline) by 0.7% in F1-score. Moreover, AWCL (coarse) also surpasses ImageNet and supervised pre-training with coarse-grained labels by 1.9% and 6.3% in F1-score. For comparison with other contrastive learning methods, we observe a similar improved trend as described in Task I, i.e. AWCL (coarse) and AWCL (fine-grained) perform better than SimCLR and CLPI. Further evidence is provided in Fig. 3, which shows that both AWCL (coarse) and AWCL (fine-grained) provide better prediction accuracy than CLPI for all anatomy categories. These experimental results again demonstrate the effectiveness of AWCL approaches and suggest that the inclusion of anatomy information in contrastive learning is a good practice when it is available at hand.

5.3 Evaluation on Segmentation of NT and CRL

Evaluations Details. In this section, we evaluate how the pre-trained models generalize to a cross-domain segmentation task with the data from the first-trimester US scans. Segmentation of NT and CRL was defined as a three-class segmentation task with the three classes being; mid-sagittal view, nuchal translucency, background. The data is divided into training and testing with 80%/20%. We follow the design of ResNet-18 auto-encoder by attaching additional decoders with the trained encoders, and then fine-tuned the entire model for 50k iterations with a learning rate of 0.001, RMSprop optimization (momentum=0.9) and a weight decay of 0.001. We apply random scaling, random shifting, random rotation, and random horizontal flipping for data augmentation. We use global average accuracy (GAA), mean accuracy (MA), and mean intersection over union (mIoU) metrics for evaluating the segmentation task (Task III).

Results and Discussion. For Task III, we find that AWCL (fine-grained), achieves comparable or slightly better performance than supervised pre-training with fine-grained labels and surpasses other compared pre-training methods by large margins in mIoU (see Table 2). In particular, it outperforms ImageNet and SimCLR by 13.8% and 7.1% in mIoU, respectively. Likewise, AWCL (coarse) performs better than ImageNet, supervised pre-training with coarse-grained labels, SimCLR and CLPI by large margins in most evaluation metrics. Figure 4 also visualizes the superior performance of AWCL (fine-grained) and AWCL (coarse) compared to SimCLR and CLPI, which aligns with the quantitative evaluation. These observations suggest that AWCL are able to learn more meaningful semantic representations that are beneficial for this pixel-wise segmentation task. Overall, the evaluated results on Tasks II and III demonstrate that the AWCL models report consistently better performance than the compared pre-trained models, implying the advantage of learning task-agnostic features that can better generalized to the tasks from different domains.

6 Analysis

6.1 Partial Fine-Tuning

To analyze representation quality, we extract fixed feature representations from the last layer of the ResNet-18 encoder and then evaluate them in two classification target tasks (Task I and Task II). Experimentally, we freeze the entire backbone encoder and attach a classification head [4] for Task I, and a non-linear classifier as mentioned in Sect. 5.2 for Task II. From Table 3, we observe that the AWCL approaches show better representation quality by surpassing the three compared approaches in terms of F1-score for both tasks. This suggests that the learned representations are strong non-linear features which are more generalizable and transferable to the downstream tasks. Comparing Tables 2 and 3, we find that although the reported scores of partial fine-tuning are generally lower than for full fine-tuning, the performance between two implementations of transfer learning is correlated.

Table 3. Performance comparison of partial fine-tuning (mean ± std. [%]) on the tasks of standard plane detection (Task I) and first-trimester anatomy recognition (Task II). Best results are marked in bold.

Full size table

6.2 Visualization of Feature Representations

In this section we investigate why the feature representations produced with AWCL pre-trained models result in better downstream task performance. We visualize the image representations of Task II extracted from the penultimate layers using t-SNE [20] in Fig. 5, where different anatomical categories are denoted with different color. We compare the resulting t-SNE embeddings of AWCL models with those as SimCLR and CLPI. We observe that the feature representation by CLPI is not quite separable, especially for classes of NT () and CRL (). The features embeddings from SimCLR are generally better separated than those in CLPI, while confusion between CRL () and Bk () remains. By comparison, AWCL (fine-grained) achieves the best separated clusters among five anatomical categories, which means that the learned representations in the embedding space are more distinguishable. These visualization results demonstrate that AWCL approaches are able to learn discriminative feature representations which are better generalized to downstream tasks.

6.3 Impact of Data Granularity on AWCL

We analyze how the inclusion of coarse- and fine-grained anatomy information impact the AWCL framework, by comparing the experimental results between AWCL (coarse) and AWCL (fine-grained) from Sect. 5.1 to Sect. 6.2. Based on the transfer learning results in Table 2, we find that AWCL (fine-grained) achieves better performance than AWCL (coarse) for Tasks II and III, despite the slight performance drop in Task I. We hypothesize that AWCL (coarse) learns more generic representations than AWCL (fine-grained), which leads to better in-domain generalization. Qualitative results in Fig. 3 and Fig. 4 also reveal the advantage of AWCL (fine-grained) over its counterpart. Based on the ablation analysis, Table 3 shows a similar finding as seen in Table 2. Figure 6 shows that feature embeddings of AWCL (fine-grained) are more discriminative than those of AWCL (coarse) thereby resulting in better generalization to downstream tasks. These observations suggest the importance of learning intra-class feature representations for better generalization to downstream tasks especially when there is a domain shift.

6.4 Impact of Anatomy Ratio on AWCL

We investigate how varying anatomy ratios impact the AWCL framework. Note that higher anatomy ratio represents that larger number of samples from same or different US scans belonging to the same anatomy category are included to form positive pairs for contrastive learning. We incorporated the anatomy information with four different ratios: 10%, 30%, 50%, and 80% to train the AWCL (fine-grained) models on the pre-training dataset, respectively. Then, we evaluate these trained models on Task II via full fine-tuning. As shown in Fig. 6, we observe that the performance improves with an increasing anatomy ratio. It suggests that using more distinct but anatomically similar samples to compose positive pairs results in a better quality representation.

7 Conclusion

In this paper, we presented a new anatomy-aware contrastive learning (AWCL) approach for fetal ultrasound imaging tasks. The proposed approach is able to leverage more positive samples from the same or different US videos with the same anatomy category and align well with the anatomical characteristics of ultrasound videos. The feature representative analysis shows AWCL approaches learn discriminative representations that can be better generalized to downstream tasks. Through the reported comparative study, AWCL with fine-grained anatomy information which preserves intra-class difference was more effective than its counterpart. Experimental evaluations demonstrate that our AWCL approach provides useful transferable representations for various downstream clinical tasks, especially for cross-domain generalization. The proposed approach can be potentially applied to other medical imaging modalities where such anatomy information is available.

Notes

1.
Every 8th frame is extracted to reduce temporal redundancy of ultrasound videos.

References

Fetal Anomaly Screen Programme Handbook. NHS Screening Programmes, London (2015)
Google Scholar
Azizi, S., et al.: Big self-supervised models advance medical image classification. arXiv:2101.05224 (2021)
Bai, W., et al.: Self-supervised learning for cardiac MR image segmentation by anatomical position prediction. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 541–549. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8_60
Chapter Google Scholar
Baumgartner, C.F., et al.: SonoNet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Trans. Med. Imaging 36(11), 2204–2215 (2017)
Article Google Scholar
Cai, Y., et al.: Spatio-temporal visual attention modelling of standard biometry plane-finding navigation. Med. Image Anal. 65, 101762 (2020)
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (ICML), pp. 1597–1607 (2020)
Google Scholar
Chen, Y., et al.: USCL: pretraining deep ultrasound image diagnosis model through video contrastive representation learning. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12908, pp. 627–637. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87237-3_60
Chapter Google Scholar
Droste, R., et al.: Ultrasound image representation learning by modeling sonographer visual attention. In: Chung, A.C.S., Gee, J.C., Yushkevich, P.A., Bao, S. (eds.) IPMI 2019. LNCS, vol. 11492, pp. 592–604. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20351-1_46
Chapter Google Scholar
Drukker, L., et al.: Transforming obstetric ultrasound into data science using eye tracking, voice recording, transducer motion and ultrasound video. Sci. Rep. 11, 14109 (2021)
Article Google Scholar
Haghighi, F., Hosseinzadeh Taher, M.R., Zhou, Z., Gotway, M.B., Liang, J.: Learning semantics-enriched representation via self-discovery, self-classification, and self-restoration. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12261, pp. 137–147. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59710-8_14
Chapter Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hosseinzadeh Taher, M.R., Haghighi, F., Feng, R., Gotway, M.B., Liang, J.: A systematic benchmarking analysis of transfer learning for medical image analysis. In: Albarqouni, S., et al. (eds.) DART/FAIR 2021. LNCS, vol. 12968, pp. 3–13. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87722-4_1
Chapter Google Scholar
Hu, S.Y., et al.: Self-supervised pretraining with DICOM metadata in ultrasound imaging. In: Proceedings of the 5th Machine Learning for Healthcare Conference, pp. 732–749 (2020)
Google Scholar
Islam, A., Chen, C.F.R., Panda, R., Karlinsky, L., Radke, R., Feris, R.: A broad study on the transferability of visual representations with contrastive learning. In: IEEE International Conference on Computer Vision (ICCV), pp. 8845–8855 (2021)
Google Scholar
Jiao, J., Cai, Y., Alsharid, M., Drukker, L., Papageorghiou, A.T., Noble, J.A.: Self-supervised contrastive video-speech representation learning for ultrasound. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 534–543. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_51
Chapter Google Scholar
Jiao, J., Droste, R., Drukker, L., Papageorghiou, A.T., Noble, J.A.: Self-supervised representation learning for ultrasound video. In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1847–1850. IEEE (2020)
Google Scholar
Khosla, P., et al.: Supervised contrastive learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 18661–18673 (2020)
Google Scholar
Kiyasseh, D., Zhu, T., Clifton, D.A.: CLOCS: contrastive learning of cardiac signals across space, time, and patients. In: International Conference on Machine Learning (ICML), vol. 139, pp. 5606–5615 (2021)
Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9 (2008)
Google Scholar
Paszke, et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035 (2019)
Google Scholar
Schlemper, J., et al.: Attention-gated networks for improving ultrasound scan plane detection. In: International Conference on Medical Imaging with Deep Learning (MIDL) (2018)
Google Scholar
Sharma, H., Drukker, L., Chatelain, P., Droste, R., Papageorghiou, A., Noble, J.: Knowledge representation and learning of operator clinical workflow from full-length routine fetal ultrasound scan videos. Med. Image Anal. 69, 101973 (2021)
Article Google Scholar
Sowrirajan, H., Yang, J., Ng, A.Y., Rajpurkar, P.: MoCo-CXR: MoCo pretraining improves representation and transferability of chest X-ray models. In: Medical Imaging with Deep Learning (MIDL) (2021)
Google Scholar
Vu, Y.N.T., Wang, R., Balachandar, N., Liu, C., Ng, A.Y., Rajpurkar, P.: MedAug: contrastive learning leveraging patient metadata improves representations for chest X-ray interpretation. In: Machine Learning for Healthcare Conference, vol. 149, pp. 755–769 (2021)
Google Scholar
Zhou, H.-Y., Yu, S., Bian, C., Hu, Y., Ma, K., Zheng, Y.: Comparing to learn: surpassing imagenet pretraining on radiographs by comparing image representations. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12261, pp. 398–407. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59710-8_39
Chapter Google Scholar
Zhou, Z., et al.: Models genesis: generic autodidactic models for 3D medical image analysis. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 384–393. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_42
Chapter Google Scholar
Zhuang, X., Li, Y., Hu, Y., Ma, K., Yang, Y., Zheng, Y.: Self-supervised feature learning for 3D medical images by playing a Rubik’s cube. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 420–428. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_46
Chapter Google Scholar

Download references

Acknowledgement

The authors would like to thank Lok Hin Lee, Richard Droste, Yuan Gao and Harshita Sharma for their help with data preparation. This work is supported by the EPSRC Programme Grants Visual AI (EP/T028572/1) and Seebibyte (EP/M013774/1), the ERC Project PULSE (ERC-ADG-2015 694581), the NIH grant U01AA014809, and the NIHR Oxford Biomedical Research Centre. The NVIDIA Corporation is thanked for a GPU donation.

Author information

Authors and Affiliations

Department of Engineering Science, University of Oxford, Oxford, UK
Zeyu Fu, Jianbo Jiao, Robail Yasrab & J. Alison Noble
Nuffield Department of Women’s and Reproductive Health, University of Oxford, Oxford, UK
Lior Drukker & Aris T. Papageorghiou
Department of Obstetrics and Gynecology, Tel-Aviv University, Tel Aviv-Yafo, Israel
Lior Drukker

Authors

Zeyu Fu
View author publications
You can also search for this author in PubMed Google Scholar
Jianbo Jiao
View author publications
You can also search for this author in PubMed Google Scholar
Robail Yasrab
View author publications
You can also search for this author in PubMed Google Scholar
Lior Drukker
View author publications
You can also search for this author in PubMed Google Scholar
Aris T. Papageorghiou
View author publications
You can also search for this author in PubMed Google Scholar
J. Alison Noble
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeyu Fu .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, Z., Jiao, J., Yasrab, R., Drukker, L., Papageorghiou, A.T., Noble, J.A. (2023). Anatomy-Aware Contrastive Representation Learning for Fetal Ultrasound. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13803. Springer, Cham. https://doi.org/10.1007/978-3-031-25066-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-25066-8_23
Published: 18 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25065-1
Online ISBN: 978-3-031-25066-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Anatomy-Aware Contrastive Representation Learning for Fetal Ultrasound

Abstract

Similar content being viewed by others

Knowledge-Guided Pretext Learning for Utero-Placental Interface Detection

Statistical Dependency Guided Contrastive Learning for Multiple Labeling in Prenatal Ultrasound

Localized Region Contrast for Enhancing Self-supervised Learning in Medical Image Segmentation

Keywords

1 Introduction

2 Related Work

3 Fetal Ultrasound Imaging Dataset