Abstract
Contrastive learning has proven effective for pre-training image models on unlabeled data with promising results for tasks such as medical image classification. Using paired text (like radiological reports) during pre-training improves the results even further. Still, most existing methods target image classification downstream tasks and may not be optimal for localized tasks like semantic segmentation or object detection. We therefore propose Localized representation learning from Vision and Text (LoVT), a text-supervised pre-training method that explicitly targets localized medical imaging tasks. Our method combines instance-level image-report contrastive learning with local contrastive learning on image region and report sentence representations. We evaluate LoVT and commonly used pre-training methods on an evaluation framework of 18 localized tasks on chest X-rays from five public datasets. LoVT performs best on 10 of the 18 studied tasks making it the preferred method of choice for localized tasks.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction and Motivation
In medical applications of computer vision, high-quality annotated data is scarce and expensive to acquire, as it typically requires trained physicians to manually label samples [37]. Therefore, the requirement for large labeled datasets can become quite problematic and may limit the applications of deep learning in this field. One approach to overcome this problem is to utilize radiological reports that are paired with medical images. Such reports are produced routinely in clinical practice and are typically written by medical experts (e.g. radiologists). They thus provide a valuable source of semantic information that is available with little additional cost. Rule-based Natural Language Processing (NLP) models like CheXpert [19] extract labels from these reports allowing the automatic creation of large datasets but they also have some significant limitations. Most importantly, such approaches are typically limited to classification tasks. They generate overall labels for reports (and therefore the paired images) but relating these labels to specific image regions is nontrivial so they cannot be used for localized tasks like semantic segmentation or object detection. Also, rule-based NLP models have to be manually created and cannot generalize to different classification tasks or even different report writing styles [19]. Instead of using these reports to generate classification labels, the reports can be utilized directly in the pre-training method, as was first proposed in the ConVIRT method [51]. Here, the semantic information contained in the reports is used as weak supervision to pre-train image models that are then fine-tuned on labeled downstream tasks, where results can be improved or the number of labeled samples can be reduced. We argue that while this approach is quite promising it is not designed for localized downstream tasks. For example, ConVIRT [51] only works on per-sample image representations and does not explicitly provide more localized representations that might be beneficial for localized tasks like semantic segmentation and object detection. In this work, we therefore propose Localized representation learning from Vision and Text (LoVT), a pre-training method that utilizes the structure of radiological reports (where each sentence typically describes a single property of the image) to pre-train image models for localized tasks. It extends ConVIRT [51] and outperforms it on most localized downstream tasks.
Our contributions are as follows:
-
We split each report into sentences and each image into regions (i.e. patches), jointly encode all sentences of the report to get representations per sentence and jointly encode all patches to get region representations.
-
We align sentence and region representations using an attention mechanism and local contrastive learning.
-
We show that this can be effectively achieved using our novel local contrastive loss that encourages spatial smoothness and sensitivity.
-
We evaluate our method trained using MIMIC-CXR [13, 22,23,24] on a downstream evaluation framework [30] with 18 localized tasks on chest X-rays, including object detection and semantic segmentation on five public datasets. We compare it with several self- and text-supervised methods and with transfer from classification in more than 1400 evaluation runs. Our method LoVT proves as the most successful method outperforming all other methods on 10 out of 18 tasks.
2 Related Work
In recent years, contrastive learning [2,3,4, 6, 7, 11, 14, 15, 17, 18, 25, 29, 31, 47, 50], has become the state-of-the-art approach for self-supervised representation learning on images. It has been successfully applied as pre-training method in medical imaging including downstream tasks such as image classification on chest X-rays [12, 41, 42].
Most contrastive learning approaches use, unlike our method, only instance-level contrast, i.e. represent each view of the image by a single vector. While the resulting representations are well-suited for global downstream tasks, they are not designed for localized downstream tasks. Therefore, there is a number of recent approaches that use region-level contrast [5, 28, 32, 46, 48, 49], i.e. they act on representations of image regions. Unlike our method, these methods do not utilize paired text.
Recently however, there is much focus on self-supervised representation learning methods that pre-train image models for downstream tasks by taking advantage of the companion text [8, 21, 27, 33, 38, 51]. VirTex [8] and ICMLM [38] use image captioning tasks (generative tasks). ConVIRT [51], CLIP [33] and ALIGN [21] on the other hand use multiview contrastive learning [1]. These approaches have been found to be more effective for discriminative downstream tasks [33]. ConVIRT, CLIP, and ALIGN all follow the same general framework where an image and a text encoder are trained jointly using the NT-Xent loss (which is also used in SimCLR) on image and text views. The text views are based on single sentences from companion text, in the case of ConVIRT it is a sentence sampled from the radiology report. The main difference between these methods is the datasets they are studied on, ConVIRT is trained on chest X-rays while the other methods use natural images. Additionally, CLIP uses attention pooling to compute image representations from feature maps while the other methods use the default pooling method from the image encoder (average pooling in the case of ResNet50 [16]). Our method follows a similar framework but adds local contrastive losses for better performance on localized tasks. Also, it encodes the whole report instead of sampling a single sentence and uses attention pooling in the image and text encoders. LocTex [27] does localized pre-training on natural images with companion text and predicts alignment of text and image regions. Unlike our method, it uses supervision generated by mouse gazes instead of learning the alignment implicitly using a local contrastive loss. Most related to our work is the recently published local Mutual Information approach [26] that performs contrastive learning on report sentences and image regions but targets classification instead of localized tasks and does therefore neither encourage contrast between regions nor spatial smoothness.
3 Method
3.1 Assumptions and Intuition
As shown in Fig. 1, a radiology report is typically split into several sections, including a Findings section, describing related radiological images, and an Assessment section, interpreting the findings. As these sections describe medical aspects observed (Findings) in one or more related images and conclusions (Assessment) drawn from it, they provide supervision for identifying relevant patterns in the images and interpretations of these patterns. Both sections can be split into sentences and each of these sentences typically describes one or a few aspects of which we assume that most are related to one or a few very localized regions in a paired image. We randomly sample one of the images related to a given report and split it into \(7 \times 7\) equally-sized regions. More precisely, we augment and resize the image to a size of \(224 \times 224\), feed it into a convolutional neural network, and use the output feature map of size \(7 \times 7\) as region representations. A language model encodes the tokens of the report as contextualized (i.e. considering their meaning in the whole report) vector representations from which we compute sentence representations. A many-to-many alignment model is then used to compute cross-modal representations from uni-modal representations, i.e. image region representations from sentence representations and vice-versa. We argue that by aligning cross-modal and uni-modal representations, the image region representations are encouraged to contain the high-level semantics present in the report.
3.2 Model Overview
Figure 2 shows the general architecture of our proposed LoVT model. Each training sample \(\boldsymbol{x}_i\) is a pair of an image \(\boldsymbol{x}^{\mathcal {I}}_i\in \mathbb {R}^{224 \times 224}\) and the related report \(\boldsymbol{x}^{\mathcal {R}}_i\) consisting of \(M_i\) sentences. Both, \(\boldsymbol{x}^{\mathcal {I}}_i\) and \(\boldsymbol{x}^{\mathcal {R}}_i\), are encoded independently into two global representations, for image and report respectively, and multiple local representations per sample, corresponding to image regions and report sentences, respectively. An attention-based alignment model then computes cross-modal representations (i.e. sentence representations from image regions and vice-versa) which are aligned with the local uni-modal representations using local contrastive losses. Additionally, the global representations are aligned using a global contrastive loss. The encoders and the alignment model are trained jointly on batches of image-report pairs \(\boldsymbol{x}_i\). The details of the model and the loss function will be described in the following sections.
3.3 Encoding
Each image \(\boldsymbol{x}^{\mathcal {I}}_i\) is encoded into \(K = H \times W\) (we use \(K = 7 \times 7\)) region representations \(\boldsymbol{y}^{\mathcal {I}}_{i,k}\in \mathbb {R}^{d^{\mathcal {I}}}\) using the image encoder \(E^{\mathcal {I}}\), where k is the index of the image region, and \(d^{\mathcal {I}}\) is the dimension of the image region representation space. Our approach is encoder agnostic, i.e. any model encoding image regions into vector representations can be used for \(E^{\mathcal {I}}\). We use a ResNet50 [16] and take the feature map before global average pooling as region representations. Similarly, each report \(\boldsymbol{x}^{\mathcal {R}}_i\) is encoded into \(M_i\) sentence representations \(\boldsymbol{y}^{\mathcal {R}}_{i,m}\in \mathbb {R}^{d^{\mathcal {R}}}\) using the report encoder \(E^{\mathcal {R}}\). Here \(M_i\) is the number of sentences of report sample i, m is the index of the sentence, and \(d^{\mathcal {R}}\) is the dimension of the report sentence representation space. Note that while K is constant, \(M_i\) may be different for each sample. Any model encoding sentences into vector representations can be used for \(E^{\mathcal {R}}\). We use BERT_base [10] to jointly encode the tokens of the concatenated sentences of each report and then perform max pooling over the token representations of each sentence to get sentence representations.
The global (i.e. per-sample) representations \(\bar{\boldsymbol{y}}^{\mathcal {I}}_{i}\) and \(\bar{\boldsymbol{y}}^{\mathcal {R}}_{i}\) are each computed by an attention pooling layer (not shared between modalities) on the region and sentence representations, respectively. It is implemented using multi-head query-key-value attention [44] where the query is computed from the globally averaged region or sentence representations. This pooling approach was first proposed for the image encoder of CLIP [33].
Following previous works [6, 14, 51], we compute projected local representations \(\boldsymbol{z}^{\mathcal {I}}_{i,k}\in \mathbb {R}^{d^{\mathcal {Z}}}\) and \(\boldsymbol{z}^{\mathcal {R}}_{i,m}\in \mathbb {R}^{d^{\mathcal {Z}}}\), and projected global representations \(\bar{\boldsymbol{z}}^{\mathcal {I}}_{i}\in \mathbb {R}^{\bar{d}^{\mathcal {Z}}}\) and \(\bar{\boldsymbol{z}}^{\mathcal {R}}_{i}\in \mathbb {R}^{\bar{d}^{\mathcal {Z}}}\) from the representations \(\boldsymbol{y}^{\mathcal {I}}_{i,k}\), \(\boldsymbol{y}^{\mathcal {R}}_{i,m}\), \(\bar{\boldsymbol{y}}^{\mathcal {I}}_{i}\), and \(\bar{\boldsymbol{y}}^{\mathcal {R}}_{i}\), using the (non-shared) nonlinear transformations \(f^{\mathcal {I}}\), \(f^{\mathcal {R}}\), \(\bar{f}^{\mathcal {I}}\), and \(\bar{f}^{\mathcal {R}}\), respectively, where \(d^{\mathcal {Z}}\) is the dimension of the shared local and \(\bar{d}^{\mathcal {Z}}\) of the shared global representation space (we use 512 for both). Note that for local representations the projections are applied to each region k or sentence m independently.
3.4 Alignment Model
Following our assumptions (see Sect. 3.1), we compute an alignment of image regions and sentences and compute cross-modal representations using the alignment models \(A^{\mathcal {I}\rightarrow \mathcal {R}}\) and \(A^{\mathcal {R}\rightarrow \mathcal {I}}\), which are based on single-head query-key-value attention [44].
For each sentence m the cross-modal representation \(\boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}\) is computed by letting \(\boldsymbol{z}^{\mathcal {R}}_{i,m}\) attend to all image region representations \(\boldsymbol{z}^{\mathcal {I}}_{i,k}\) (of the related image). We therefore compute the probability \(\alpha ^{\mathcal {I}\rightarrow \mathcal {R}}_{i, m, k}\) that sentence m is aligned with region k based on the scaled dot product scores of their projected representations, i.e. \(\alpha ^{\mathcal {I}\rightarrow \mathcal {R}}_{i, m, k}= \text {softmax}_{k}\left( \frac{(\boldsymbol{Q}\boldsymbol{z}^{\mathcal {R}}_{i,m})^T(\boldsymbol{Q}\boldsymbol{z}^{\mathcal {I}}_{i,k})}{\sqrt{d^{\mathcal {Z}}}}\right) \), where the linear query-key projection \(\boldsymbol{Q}\) is a learned matrix. Then the alignment model \(A^{\mathcal {I}\rightarrow \mathcal {R}}\) uses \(\alpha ^{\mathcal {I}\rightarrow \mathcal {R}}_{i, m, k}\) to compute \(\boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}\) as projected weighted sum of the image region representations \(\boldsymbol{z}^{\mathcal {I}}_{i,k}\):
where the value projection \(\boldsymbol{V}\), and the output projection \(\boldsymbol{O}\) are learned matrices.
In a similar fashion the cross-modal representations \(\boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,k}\) are computed by \(A^{\mathcal {R}\rightarrow \mathcal {I}}\):
with \(\alpha ^{\mathcal {R}\rightarrow \mathcal {I}}_{i, k, m}= \text {softmax}_{m}\left( \frac{(\boldsymbol{Q}\boldsymbol{z}^{\mathcal {I}}_{i,k})^T(\boldsymbol{Q}\boldsymbol{z}^{\mathcal {R}}_{i,m})}{\sqrt{d^{\mathcal {Z}}}}\right) \). Note that as \(A^{\mathcal {R}\rightarrow \mathcal {I}}\) and \(A^{\mathcal {I}\rightarrow \mathcal {R}}\) share the same matrices \(\boldsymbol{Q}\), \(\boldsymbol{V}\), and \(\boldsymbol{O}\), the only difference between \(\alpha ^{\mathcal {R}\rightarrow \mathcal {I}}_{i, k, m}\) and \(\alpha ^{\mathcal {I}\rightarrow \mathcal {R}}_{i, m, k}\) is transposition and the index over which softmax is applied.
3.5 Loss Function
Global Alignment. For global alignment we follow ConVIRT [51] and maximize the cosine similarity between paired image and report representations while minimizing the similarity between non-paired (i.e. from different samples) representations. The loss consists of a image-report part, where all non-paired report representations from the batch are used as negatives:
and a report-image part, defined analogously:
where \(\tau \) is the similarity temperature (we use 0.1) and all logarithms are natural. Both parts are combined using the hyperparameter \(\lambda \in [0, 1]\) (we use 0.75):
Local Alignment. The global alignment loss does not only align the global representations but it also prevents the global representations from collapsing to a constant vector using negative samples to contrast the positive pairs. Similarly, we propose local alignment losses encouraging spatial (sentence) sensitivity through negatives from the same sample, i.e. preventing the local representations to be similar for all regions (sentences) of an image (report). We use two NT-Xent-based [6] local losses: \(\mathcal {L}_\text {local-image}\), aligning region representations \(\boldsymbol{z}^{\mathcal {I}}_{i,k}\) with \(\boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,k}\), and \(\mathcal {L}_\text {local-report}\), aligning sentence representations \(\boldsymbol{z}^{\mathcal {R}}_{i,m}\) with \(\boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}\).
Some regions or sentences may not be relevant for aligning a sample (e.g. background regions or sentences not related to the image). Therefore, we introduce region weights \(w^{\mathcal {I}}_{i, k}\) and sentence weights \(w^{\mathcal {R}}_{i, m}\), which are computed as the attention probabilities from the respective attention pooling layer (which was used to compute global representations), averaged over all attention heads. These weights are used in the local loss functions such that irrelevant representations do not have to be aligned. Note that we do not backpropagate through the region or sentence weights.
The loss \(\mathcal {L}_\text {local-image}\) allows for having multiple positive pairs within each sample by giving each pair of regions (k, l) a positiveness probability \(p^{\mathcal {I}}_{k, l}\in [0, 1]\). We then treat each positive pair as its own (weighted) example and contrast it with all other pairs (again all logarithms are natural):
Here \(\tau '\) is the similarity temperature and is set to 0.3. We assume that nearby image regions are often similar and that therefore nearby regions are more likely to be positives while distant regions are more likely to be negatives. Thus, we define the positiveness probability \(p^{\mathcal {I}}_{k, l}\) of two image regions as the complementary cumulative exponential distribution of \(d_{\boldsymbol{x}}\) (their spatial \(\ell _2\)-distance in 2D space normalized by the length of the diagonal \(\sqrt{H^2 + W^2}\)) and set \(p^{\mathcal {I}}_{k, l}\) to zero above cutoff threshold \(T \in [0, \infty )\):
Here \(\beta \in (0, \infty )\) is a sharpness hyperparameter. We set \(\beta =1\) and \(T = 0.5\). Note that the normalization of \(d_{\boldsymbol{x}}\) is equal to rescaling T and \(\beta \), i.e. it allows us to define both hyperparameters independently of the image size.
The definition of \(p^{\mathcal {I}}_{k, l}\) is derived by modeling the occurrence of related features at specific distances in the image as a Poisson point process, such that the \(\ell _2\)-distance of related features follows the exponential distribution. We assume a Poisson process due to its property of being memoryless, i.e. knowing that a feature is already related to another feature at some distance does not change how distant additional related features can be found. Also, the probability density function of the exponential distribution is decreasing (with support on the interval \([0, \infty )\)), which seems reasonable as it is typically more likely that related features are near than far. Its cumulative distribution function then describes the probability that two related features are within a given radius and its complementary function that of being outside a given radius. The threshold T assures that very distant pairs do not count as positives. The loss \(\mathcal {L}_\text {local-image}\) thus encourages spatial smoothness of image regions while maintaining spatial sensitivity through negative samples. Note that it is related to the pixel-contrast loss proposed in [49], where the main novelty of our work is the partly smooth definition of \(p^{\mathcal {I}}_{k, l}\) based on the exponential distribution.
The local report loss \(\mathcal {L}_\text {local-report}\) is defined similarly but we do not assume prior knowledge about the similarity of sentences and therefore only have a single positive pair per sentence (again all logarithms are natural):
Total Loss. The total loss \(\mathcal {L}\) is computed as the weighted sum of global and local losses:
where \(\gamma \), \(\mu \), and \(\nu \) are loss weights to balance the individual losses and are set to 1.0, 0.75, and 0.75, respectively. We determined these loss weights by running small grid searches (see supplementary material for details).
4 Evaluation
4.1 Downstream Tasks and Experimental Setup
We evaluate our method on a downstream evaluation framework [30] with 18 localized tasks on chest X-rays, which we will shortly describe here. For more details, we refer to the supplementary material.
Evaluation Protocols. We only evaluate the pre-trained ResNet50 (from the image encoder). For semantic segmentation tasks we use the following evaluation protocols: (i) U-Net Finetune, where the ResNet50 is used as the backbone of a U-Net [35] and is finetuned jointly with all other layers, (ii) U-Net Frozen, where the ResNet50 is used as the frozen backbone of a U-Net [35] and only the non-backbone layers are trained, and (iii) Linear, where an element-wise linear layer is trained that is applied to the feature map of the frozen ResNet50, and then results are upsampled to the segmentation resolution.
For object detection tasks we use the following protocols: (i) YOLOv3 Finetune, where the ResNet50 is used as the backbone of a YOLOv3 [34] model and is finetuned jointly with all other layers, (ii) YOLOv3 Frozen, where the ResNet50 is used as the frozen backbone of a YOLOv3 [34] model and only the non-backbone layers are trained, and (iii) Linear, where the object detection ground truth is converted to segmentation masks and the Linear evaluation protocol is applied.
Downstream Datasets. We evaluate the pre-trained ResNet50 on several medical datasets, namely (i) RSNA Pneumonia Detection [39, 45], with more than 260000 frontal-view chest X-rays with detection targets for pneumonia opacities. We use the YOLOv3 Finetune, YOLOv3 Frozen, and Linear protocols, each with \(1 \%\), \(10 \%\), and \(100 \%\) of the training samples; (ii) COVID Rural [9, 43], with more than 200 frontal-view chest X-rays with segmentation masks for COVID-19 lung opacity regions. We use the UNet Finetune, UNet Frozen, and Linear protocols; (iii) SIIM-ACR Pneumothorax Segmentation [40], with more than 12000 frontal-view chest X-rays with segmentation masks for pneumothorax. We use the UNet Finetune, UNet Frozen protocols, but due not use Linear due to the fine-grained nature of the segmentation masks; (iv) Object CXR [20] with 9000 frontal-view chest X-rays with detection targets for foreign objects. We use the YOLOv3 Finetune, YOLOv3 Frozen, and Linear protocols; (v) NIH CXR [45], with almost 1000 frontal-view chest X-rays with detection targets for eight pathologies (Atelectasis, Cardiomegaly, Effusion, Infiltrate, Mass, Nodule, Pneumonia, and Pneumothorax). Due to the limited data per class, we only use the Linear protocol. The different evaluation protocols are complementary to each other, where the U-Net Finetune and YOLOv3 Finetune protocols evaluate how well suited the pre-trained image models are for fine-tuning as used in practical applications and the Linear protocols evaluate the quality of learned local representations (i.e. feature maps) while adding only a few parameters and therefore mostly omitting the variance introduced by random initialization during downstream evaluation. The U-Net Frozen and YOLOv3 Frozen protocols are a trade-off, where representations are frozen but evaluated in a more practical setting (with several randomly initialized layers).
Tuning and Evaluation Procedure. All baselines and our models have been tuned only on a single downstream task, RSNA YOLOv3 Frozen 10%, where a single fixed downstream learning rate was used (determined in preliminary experiments) and the results of five runs have been averaged. Other downstream tasks have not been evaluated during tuning to make sure that models are not biased towards the downstream tasks. After tuning, we evaluated each model on all downstream tasks: The learning rates were tuned individually per model and task (using single evaluation runs) before running five evaluations per task (all using the tuned learning rate). We report the average results of these five runs and their \(95\%\)-confidence interval (where each evaluation run is considered a sample).
Pre-Training Dataset. We train our method on MIMIC-CXR [13, 22,23,24] (version 2) as, to our best knowledge, it is the largest and most commonly used dataset of this kind. Since all downstream tasks contain only frontal views, we remove all lateral views, such that roughly 21000 training samples remain, each with a report and one or more frontal images.
Baselines. We compare our method against several baseline methods:
-
Random Init.: The ResNet50 is initialized using its default random initialization
-
ImageNet [36] Init.: The ResNet50 is initialized with weights pre-trained on the ImageNet ILSVRC-2012 task [36];
-
CheXpert [19]: The ResNet50 is pre-trained using supervised multi-label binary classification with CheXpert [19] labels on frontal chest X-rays of MIMIC-CXR
-
Global image pre-training methods: The ResNet50 is pre-trained using the self-supervised pre-training methods SimCLR [6] or BYOL [14] on frontal chest X-rays of MIMIC-CXR. We decided to include SimCLR as is uses a similar loss function as LoVT and we include BYOL because of its widespread use.
-
Local image pre-training methods: The ResNet50 is pre-trained using the self-supervised pre-training method PixelPro [49] on frontal chest X-rays of MIMIC-CXR. We include PixelPro to study the effect of local contrastive losses when using only images.
-
Global image-text pre-training methods: The ResNet50 is pre-trained using the image-text methods ConVIRT [51] or CLIP [33] on frontal MIMIC-CXR. Note that for comparability we adapted CLIP to use the same image and text encoders as ConVIRT such that the main difference between CLIP and ConVIRT is that CLIP uses attention pooling to compute the scan representation while ConVIRT uses average pooling. We include both methods as LoVT builds upon a similar general framework, where we include ConVIRT because it targets chest X-rays (like LoVT) and include CLIP because of its widespread use and as it uses (like LoVT) attention pooling in the image encoder. We decided not to include VirTex [8] and ICMLM [38] as they use generative tasks, which have been found to be less effective for discriminative downstream tasks [33].
4.2 Downstream Results
We present the downstream results of our model LoVT and the baselines, with pre-training on 100% and 30% of MIMIC-CXR. Table 1 shows the results on different subsets of the RSNA dataset and Table 2 shows the results on the remaining downstream datasets, i.e. on COVID Rural, SIM-ACR Pneumothorax, Object CXR, and NIH CXR.
Comparison of Methods. We found that there is no single pre-training method performing best on all evaluated downstream tasks. On most tasks (15 out of 18) image-text self-supervised methods (i.e. LoVT, CLIP, or ConVIRT) outperform the other methods, such that they should be preferred if paired text is available.
Our model LoVT is the best method (over all pre-training settings) on 10 of 18 tasks, and significantly outperforms all other methods in 6 of these tasks, while the second-best method CLIP significantly outperforms all other methods only on 2 tasks. LoVT outperforms image-only methods (i.e. BYOL, SimCLR, and PixelPro) on 14 tasks, where the localized image-only method PixelPro outperforms LoVT only on one task (task 15). On 11 tasks LoVT outperforms other text-supervised methods (i.e. ConVIRT and CLIP), on 14 tasks it outperforms CheXpert classification and on all but two tasks it outperforms ImageNet initialization. When using 100% of the pre-training data LoVT is the best pre-training method on 11 tasks (better by at least the confidence interval on 5 tasks) and when using 30% on 11 tasks (significantly the best on 4 tasks). LoVT performs best on all COVID Rural tasks, best on most Linear tasks, and quite well on the Frozen protocol, but does not perform well on the NIH CXR dataset and when finetuned on the RSNA dataset. As there is no single method performing best on all tasks and LoVT performs best in the majority of tasks, this makes LoVT the default method of choice for localized downstream tasks.
Relevance of Pre-Training Dataset Size. We do not observe a consistent benefit of using roughly 210000 pre-training samples (i.e. 100% of the data) over using roughly 63000 samples (i.e. 30%). While on some datasets like RSNA and Object CXR many methods often perform better when pre-trained on 210000 samples (100%), on other datasets like COVID Rural, methods often perform better when pre-trained on 63000 samples (30%). When comparing LoVT pre-trained on 30% of the data with other methods pre-trained in both settings (i.e. 30% and 100%), we observe that LoVT outperforms image-only methods (i.e. BYOL, SimCLR, and PixelPro) on 12 tasks, other text-supervised methods (i.e. ConVIRT and CLIP) on 7 tasks and CheXpert classification on 12 tasks, showing that LoVT effectively reduces the number of required pre-training samples.
Relevance of Downstream Dataset Size. The results shown in Table 1 suggest that, as expected, larger downstream training sets lead to better results. However, we observe that for text-supervised methods (i.e. LoVT, CLIP, and ConVIRT), the downstream training set size is often less relevant compared to other methods. On the RSAN YOLOv3 Frozen tasks, LoVT (100%) outperforms ImageNet initialization by 31% when using 100% of the downstream samples, while it outperforms ImageNet initialization by even 167% when only using 1% of the samples.
Spatial Smoothness and Sensitivity. We analyze the influence of the local losses and attention pooling on the spatial smoothness and sensitivity of image region representations and therefore plot in Fig. 3 the distributions of the cosine similarity of image region pairs over their spatial distances. For our LoVT model spatial smoothness and sensitivity can be observed as the quartiles and extreme points of the cosine similarity distributions decrease monotonously with increasing spatial distance, except for a few very distant region pairs with distances larger than 0.6. Note that these spatially very distant region pairs very likely represent opposite borders (or corners) of the image such that they both very likely contain background, explaining that they have more similar representations. Without local losses \(\mathcal {L}_\text {local-image}\) and \(\mathcal {L}_\text {local-report}\), the quartiles and extreme points decrease only for small spatial distances while increasing again for points further away, showing that spatial smoothness is only present for nearby regions and spatial sensitivity of more distant region is not optimal. When additionally replacing attention pooling with average (for image regions) and max (for sentences) pooling, similar results can be observed except that the quartiles are decreasing faster and the maximum points do not decrease for nearby regions. We can therefore deduce that the local losses effectively encourage spatial smoothness and sensitivity while attention pooling alone has only little effect.
Analysis of LoVT and Ablation Study. We refer to the supplementary material for a detailed analysis of our method LoVT, including an ablation study (focusing on local weighting, global and local losses, and attention pooling), an analysis of the distribution and alignment of learned representations, and an analysis of the region weights \(w^{\mathcal {I}}_{i, k}\).
5 Discussion
Limitations of Our Evaluation Procedure. In the evaluation procedure, we did not apply extensive hyperparameter tuning, resized all inputs to a resolution of only \(224 \times 224\), and applied no data augmentation. The presented downstream results are therefore below results typically reported on these datasets. We followed [30] and kept the evaluation procedure simple to limit computational resources and avoid bias induced by tuning to allow for a fair comparison of our method with the baselines.
Limitations of LoVT. LoVT learns its alignment model implicitly based only on latent representations and instance-level pairing information. This makes the model sensitive to hyperparameters and hard to train. Also, it only uses local negatives from the same sample which restricts the number of negatives and may therefore limit its performance. Additionally, the alignment model is restricted to a simple attention mechanism and the regions are based on fixed patches that are not adaptive to the contents of the image. This may restrict the capabilities of the model and therefore of the pre-training method. For a detailed discussion of these limitations as well as of the potential negative societal impact we refer to the supplementary material.
Conclusion. We study pre-training for localized medical imaging on chest X-rays and propose a novel text-supervised method called LoVT, that combines instance-level contrastive learning with local contrastive learning. We evaluate our method on 18 localized tasks on chest X-rays and compare it with typically used pre-training and initialization methods. While there is no single best method for all tasks, our method LoVT is the best method on 10 out of 18 studied tasks making it the method of choice for localized tasks.
We hope that our work provides valuable insights that encourage using pre-training for localized medical imaging and that our method inspires future work on localized text-supervised pre-training.
References
Bachman, P., Hjelm, R., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: NeurIPS (2019)
Bardes, A., Ponce, J., LeCun, Y.: VICReg: variance-invariance-covariance regularization for self-supervised learning. In: ICLR (2022)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: NeurIPS (2020)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV, pp. 9630–9640 (2021). https://doi.org/10.1109/ICCV48922.2021.00951
Chaitanya, K., Erdil, E., Karani, N., Konukoglu, E.: Contrastive learning of global and local features for medical image segmentation with limited annotations. In: NeurIPS (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Chen, X., He, K.: Exploring simple Siamese representation learning. In: CVPR, pp. 15745–15753 (2021). https://doi.org/10.1109/CVPR46437.2021.01549
Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: CVPR, pp. 11157–11168 (2021). https://doi.org/10.1109/CVPR46437.2021.01101
Desai, S., et al.: Data from chest imaging with clinical and genomic correlates representing a rural COVID-19 positive population [data set]. The Cancer Imaging Archive (2020). https://doi.org/10.7937/tcia.2020.py71-5978
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423
Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representation learning. In: ICML, pp. 3015–3024 (2021)
Gazda, M., Plavka, J., Gazda, J., Drotár, P.: Self-supervised deep convolutional neural network for chest x-ray classification. IEEE Access, 151972–151982 (2021). https://doi.org/10.1109/ACCESS.2021.3125324
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., et al.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation [Online] 101(23), 215–220 (2000)
Grill, J.B., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: NeurIPS (2020)
He, K., Fan, H., Wu, Y., et al.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9726–9735 (2020). https://doi.org/10.1109/CVPR42600.2020.00975
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2019)
Hénaff, O.J., Srinivas, A., et al.: Data-efficient image recognition with contrastive predictive coding. In: ICML, pp. 4182–4192 (2019)
Irvin, J., et al.: CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In: AAAI, pp. 590–597 (2019)
JF-Healthcare: object-CXR - automatic detection of foreign objects on chest x-rays. MIDL (2020). https://jfhealthcare.github.io/object-CXR/
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Johnson, A., Lungren, M., Peng, Y., et al.: MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet (2019). https://doi.org/10.13026/8360-t248
Johnson, A., Pollard, T., Berkowitz, S., et al.: MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6(317) (2019). https://doi.org/10.1038/s41597-019-0322-0
Johnson, A., Pollard, T., Mark, R., Berkowitz, S., Horng, S.: MIMIC-CXR database (version 2.0.0). PhysioNet (2019). https://doi.org/10.13026/C2JT1Q
Li, J., Zhou, P., Xiong, C., Hoi, S.C.H.: Prototypical contrastive learning of unsupervised representations. In: ICLR (2021)
Liao, R., et al.: Multimodal representation learning via maximization of local mutual information. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 273–283. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_26
Liu, Z., Stent, S., Li, J., Gideon, J., Han, S.: LocTex: learning data-efficient visual representations from localized textual supervision. In: ICCV, pp. 2147–2156 (2021). https://doi.org/10.1109/ICCV48922.2021.00217
Mahendran, A., Thewlis, J., Vedaldi, A.: Cross pixel optical-flow similarity for self-supervised learning. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 99–116. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20873-8_7
Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. In: CVPR, pp. 6706–6716 (2020). https://doi.org/10.1109/CVPR42600.2020.00674
Müller, P., Kaissis, G., Zou, C., Rueckert, D.: Radiological reports improve pre-training for localized imaging tasks on chest x-rays. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) MICCAI 2022. LNCS, vol. 13435, pp. 647–657. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16443-9_62
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv: 1807.03748 (2019)
Pinheiro, P.O., Almahairi, A., Benmalek, R.Y., Golemo, F., Courville, A.: Unsupervised learning of dense visual representations. In: NeurIPS (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv: 1804.02767 (2018)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Saraf, V., Chavan, P., Jadhav, A.: Deep learning challenges in medical imaging. In: Vasudevan, H., Michalas, A., Shekokar, N., Narvekar, M. (eds.) Advanced Computing Technologies and Applications. AIS, pp. 293–301. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-3242-9_28
Sariyildiz, M.B., Perez, J., Larlus, D.: Learning visual representations with caption annotations. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 153–170. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_10
Shih, G., et al.: Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiol. Artif. Intell. 1 (2019). https://doi.org/10.1148/ryai.2019180041
Society for Imaging Informatics in Medicine: SIIM-ACR pneumothorax segmentation (2019). https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation
Sowrirajan, H., Yang, J., Ng, A.Y., Rajpurkar, P.: MoCo pretraining improves representation and transferability of chest x-ray models. In: MIDL (2021)
Sriram, A., et al.: COVID-19 prognosis via self-supervised representation learning and multi-image prediction. arXiv preprint arXiv: 2101.04909 (2021)
Tang, H., Sun, N., Li, Y.: Segmentation model of the opacity regions in the chest X-rays of the COVID-19 patients in the us rural areas and the application to the disease severity. medRxiv (2020). https://doi.org/10.1101/2020.10.19.20215483
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Wang, X., Peng, Y., Lu, L., et al.: ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: CVPR, pp. 3462–3471 (2017). https://doi.org/10.1109/CVPR.2017.369
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: CVPR, pp. 3023–3032 (2021). https://doi.org/10.1109/CVPR46437.2021.00304
Wu, Z., Xiong, Y., Yu, S., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR, pp. 3733–3742 (2018). https://doi.org/10.1109/CVPR.2018.00393
Xie, E., et al.: DetCo: unsupervised contrastive learning for object detection. In: ICCV, pp. 8372–8381 (2021). https://doi.org/10.1109/ICCV48922.2021.00828
Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., Hu, H.: Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. In: CVPR, pp. 16679–16688 (2021). https://doi.org/10.1109/CVPR46437.2021.01641
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: ICML (2021)
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv: 2010.00747 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Müller, P., Kaissis, G., Zou, C., Rueckert, D. (2022). Joint Learning of Localized Representations from Medical Images and Reports. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13686. Springer, Cham. https://doi.org/10.1007/978-3-031-19809-0_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-19809-0_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19808-3
Online ISBN: 978-3-031-19809-0
eBook Packages: Computer ScienceComputer Science (R0)