Keywords

1 Introduction and Motivation

In medical applications of computer vision, high-quality annotated data is scarce and expensive to acquire, as it typically requires trained physicians to manually label samples [37]. Therefore, the requirement for large labeled datasets can become quite problematic and may limit the applications of deep learning in this field. One approach to overcome this problem is to utilize radiological reports that are paired with medical images. Such reports are produced routinely in clinical practice and are typically written by medical experts (e.g. radiologists). They thus provide a valuable source of semantic information that is available with little additional cost. Rule-based Natural Language Processing (NLP) models like CheXpert [19] extract labels from these reports allowing the automatic creation of large datasets but they also have some significant limitations. Most importantly, such approaches are typically limited to classification tasks. They generate overall labels for reports (and therefore the paired images) but relating these labels to specific image regions is nontrivial so they cannot be used for localized tasks like semantic segmentation or object detection. Also, rule-based NLP models have to be manually created and cannot generalize to different classification tasks or even different report writing styles [19]. Instead of using these reports to generate classification labels, the reports can be utilized directly in the pre-training method, as was first proposed in the ConVIRT method [51]. Here, the semantic information contained in the reports is used as weak supervision to pre-train image models that are then fine-tuned on labeled downstream tasks, where results can be improved or the number of labeled samples can be reduced. We argue that while this approach is quite promising it is not designed for localized downstream tasks. For example, ConVIRT [51] only works on per-sample image representations and does not explicitly provide more localized representations that might be beneficial for localized tasks like semantic segmentation and object detection. In this work, we therefore propose Localized representation learning from Vision and Text (LoVT), a pre-training method that utilizes the structure of radiological reports (where each sentence typically describes a single property of the image) to pre-train image models for localized tasks. It extends ConVIRT [51] and outperforms it on most localized downstream tasks.

Our contributions are as follows:

  • We split each report into sentences and each image into regions (i.e. patches), jointly encode all sentences of the report to get representations per sentence and jointly encode all patches to get region representations.

  • We align sentence and region representations using an attention mechanism and local contrastive learning.

  • We show that this can be effectively achieved using our novel local contrastive loss that encourages spatial smoothness and sensitivity.

  • We evaluate our method trained using MIMIC-CXR [13, 22,23,24] on a downstream evaluation framework [30] with 18 localized tasks on chest X-rays, including object detection and semantic segmentation on five public datasets. We compare it with several self- and text-supervised methods and with transfer from classification in more than 1400 evaluation runs. Our method LoVT proves as the most successful method outperforming all other methods on 10 out of 18 tasks.

2 Related Work

In recent years, contrastive learning [2,3,4, 6, 7, 11, 14, 15, 17, 18, 25, 29, 31, 47, 50], has become the state-of-the-art approach for self-supervised representation learning on images. It has been successfully applied as pre-training method in medical imaging including downstream tasks such as image classification on chest X-rays [12, 41, 42].

Most contrastive learning approaches use, unlike our method, only instance-level contrast, i.e. represent each view of the image by a single vector. While the resulting representations are well-suited for global downstream tasks, they are not designed for localized downstream tasks. Therefore, there is a number of recent approaches that use region-level contrast [5, 28, 32, 46, 48, 49], i.e. they act on representations of image regions. Unlike our method, these methods do not utilize paired text.

Recently however, there is much focus on self-supervised representation learning methods that pre-train image models for downstream tasks by taking advantage of the companion text [8, 21, 27, 33, 38, 51]. VirTex [8] and ICMLM [38] use image captioning tasks (generative tasks). ConVIRT [51], CLIP [33] and ALIGN [21] on the other hand use multiview contrastive learning [1]. These approaches have been found to be more effective for discriminative downstream tasks [33]. ConVIRT, CLIP, and ALIGN all follow the same general framework where an image and a text encoder are trained jointly using the NT-Xent loss (which is also used in SimCLR) on image and text views. The text views are based on single sentences from companion text, in the case of ConVIRT it is a sentence sampled from the radiology report. The main difference between these methods is the datasets they are studied on, ConVIRT is trained on chest X-rays while the other methods use natural images. Additionally, CLIP uses attention pooling to compute image representations from feature maps while the other methods use the default pooling method from the image encoder (average pooling in the case of ResNet50 [16]). Our method follows a similar framework but adds local contrastive losses for better performance on localized tasks. Also, it encodes the whole report instead of sampling a single sentence and uses attention pooling in the image and text encoders. LocTex [27] does localized pre-training on natural images with companion text and predicts alignment of text and image regions. Unlike our method, it uses supervision generated by mouse gazes instead of learning the alignment implicitly using a local contrastive loss. Most related to our work is the recently published local Mutual Information approach [26] that performs contrastive learning on report sentences and image regions but targets classification instead of localized tasks and does therefore neither encourage contrast between regions nor spatial smoothness.

3 Method

3.1 Assumptions and Intuition

As shown in Fig. 1, a radiology report is typically split into several sections, including a Findings section, describing related radiological images, and an Assessment section, interpreting the findings. As these sections describe medical aspects observed (Findings) in one or more related images and conclusions (Assessment) drawn from it, they provide supervision for identifying relevant patterns in the images and interpretations of these patterns. Both sections can be split into sentences and each of these sentences typically describes one or a few aspects of which we assume that most are related to one or a few very localized regions in a paired image. We randomly sample one of the images related to a given report and split it into \(7 \times 7\) equally-sized regions. More precisely, we augment and resize the image to a size of \(224 \times 224\), feed it into a convolutional neural network, and use the output feature map of size \(7 \times 7\) as region representations. A language model encodes the tokens of the report as contextualized (i.e. considering their meaning in the whole report) vector representations from which we compute sentence representations. A many-to-many alignment model is then used to compute cross-modal representations from uni-modal representations, i.e. image region representations from sentence representations and vice-versa. We argue that by aligning cross-modal and uni-modal representations, the image region representations are encouraged to contain the high-level semantics present in the report.

Fig. 1.
figure 1

Taken from the MIMIC-CXR [13, 23, 24] dataset.

Example radiology report describing chest X-Rays.

3.2 Model Overview

Figure 2 shows the general architecture of our proposed LoVT model. Each training sample \(\boldsymbol{x}_i\) is a pair of an image \(\boldsymbol{x}^{\mathcal {I}}_i\in \mathbb {R}^{224 \times 224}\) and the related report \(\boldsymbol{x}^{\mathcal {R}}_i\) consisting of \(M_i\) sentences. Both, \(\boldsymbol{x}^{\mathcal {I}}_i\) and \(\boldsymbol{x}^{\mathcal {R}}_i\), are encoded independently into two global representations, for image and report respectively, and multiple local representations per sample, corresponding to image regions and report sentences, respectively. An attention-based alignment model then computes cross-modal representations (i.e. sentence representations from image regions and vice-versa) which are aligned with the local uni-modal representations using local contrastive losses. Additionally, the global representations are aligned using a global contrastive loss. The encoders and the alignment model are trained jointly on batches of image-report pairs \(\boldsymbol{x}_i\). The details of the model and the loss function will be described in the following sections.

Fig. 2.
figure 2

Architecture of LoVT. Given an image \(\boldsymbol{x}^{\mathcal {I}}_i\) and the related report \(\boldsymbol{x}^{\mathcal {R}}_i\), the encoders \(E^{\mathcal {I}}\) and \(E^{\mathcal {R}}\) compute image region and report sentence representations, respectively, which are projected using \(f^{\mathcal {I}}\) and \(f^{\mathcal {R}}\). The alignment models \(A^{\mathcal {R}\rightarrow \mathcal {I}}\) and \(A^{\mathcal {I}\rightarrow \mathcal {R}}\) compute cross-modal report-to-image (\(\boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,k}\)) and image-to-report (\(\boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}\)) representations which are aligned with the uni-modal representations (\(\boldsymbol{z}^{\mathcal {I}}_{i,k}\) and \(\boldsymbol{z}^{\mathcal {R}}_{i,m}\)) using the local losses \(\mathcal {L}_\text {local-image}\) and \(\mathcal {L}_\text {local-report}\), respectively. Global image (\(\bar{\boldsymbol{y}}^{\mathcal {I}}_{i}\)) and report (\(\bar{\boldsymbol{y}}^{\mathcal {R}}_{i}\)) representations are computed using attention pooling on the local representations, are then projected using \(\bar{f}^{\mathcal {I}}\) and \(\bar{f}^{\mathcal {R}}\) and aligned using the global loss \(\mathcal {L}_\text {global}\).

3.3 Encoding

Each image \(\boldsymbol{x}^{\mathcal {I}}_i\) is encoded into \(K = H \times W\) (we use \(K = 7 \times 7\)) region representations \(\boldsymbol{y}^{\mathcal {I}}_{i,k}\in \mathbb {R}^{d^{\mathcal {I}}}\) using the image encoder \(E^{\mathcal {I}}\), where k is the index of the image region, and \(d^{\mathcal {I}}\) is the dimension of the image region representation space. Our approach is encoder agnostic, i.e. any model encoding image regions into vector representations can be used for \(E^{\mathcal {I}}\). We use a ResNet50 [16] and take the feature map before global average pooling as region representations. Similarly, each report \(\boldsymbol{x}^{\mathcal {R}}_i\) is encoded into \(M_i\) sentence representations \(\boldsymbol{y}^{\mathcal {R}}_{i,m}\in \mathbb {R}^{d^{\mathcal {R}}}\) using the report encoder \(E^{\mathcal {R}}\). Here \(M_i\) is the number of sentences of report sample i, m is the index of the sentence, and \(d^{\mathcal {R}}\) is the dimension of the report sentence representation space. Note that while K is constant, \(M_i\) may be different for each sample. Any model encoding sentences into vector representations can be used for \(E^{\mathcal {R}}\). We use BERT_base [10] to jointly encode the tokens of the concatenated sentences of each report and then perform max pooling over the token representations of each sentence to get sentence representations.

The global (i.e. per-sample) representations \(\bar{\boldsymbol{y}}^{\mathcal {I}}_{i}\) and \(\bar{\boldsymbol{y}}^{\mathcal {R}}_{i}\) are each computed by an attention pooling layer (not shared between modalities) on the region and sentence representations, respectively. It is implemented using multi-head query-key-value attention [44] where the query is computed from the globally averaged region or sentence representations. This pooling approach was first proposed for the image encoder of CLIP [33].

Following previous works [6, 14, 51], we compute projected local representations \(\boldsymbol{z}^{\mathcal {I}}_{i,k}\in \mathbb {R}^{d^{\mathcal {Z}}}\) and \(\boldsymbol{z}^{\mathcal {R}}_{i,m}\in \mathbb {R}^{d^{\mathcal {Z}}}\), and projected global representations \(\bar{\boldsymbol{z}}^{\mathcal {I}}_{i}\in \mathbb {R}^{\bar{d}^{\mathcal {Z}}}\) and \(\bar{\boldsymbol{z}}^{\mathcal {R}}_{i}\in \mathbb {R}^{\bar{d}^{\mathcal {Z}}}\) from the representations \(\boldsymbol{y}^{\mathcal {I}}_{i,k}\), \(\boldsymbol{y}^{\mathcal {R}}_{i,m}\), \(\bar{\boldsymbol{y}}^{\mathcal {I}}_{i}\), and \(\bar{\boldsymbol{y}}^{\mathcal {R}}_{i}\), using the (non-shared) nonlinear transformations \(f^{\mathcal {I}}\), \(f^{\mathcal {R}}\), \(\bar{f}^{\mathcal {I}}\), and \(\bar{f}^{\mathcal {R}}\), respectively, where \(d^{\mathcal {Z}}\) is the dimension of the shared local and \(\bar{d}^{\mathcal {Z}}\) of the shared global representation space (we use 512 for both). Note that for local representations the projections are applied to each region k or sentence m independently.

3.4 Alignment Model

Following our assumptions (see Sect. 3.1), we compute an alignment of image regions and sentences and compute cross-modal representations using the alignment models \(A^{\mathcal {I}\rightarrow \mathcal {R}}\) and \(A^{\mathcal {R}\rightarrow \mathcal {I}}\), which are based on single-head query-key-value attention [44].

For each sentence m the cross-modal representation \(\boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}\) is computed by letting \(\boldsymbol{z}^{\mathcal {R}}_{i,m}\) attend to all image region representations \(\boldsymbol{z}^{\mathcal {I}}_{i,k}\) (of the related image). We therefore compute the probability \(\alpha ^{\mathcal {I}\rightarrow \mathcal {R}}_{i, m, k}\) that sentence m is aligned with region k based on the scaled dot product scores of their projected representations, i.e. \(\alpha ^{\mathcal {I}\rightarrow \mathcal {R}}_{i, m, k}= \text {softmax}_{k}\left( \frac{(\boldsymbol{Q}\boldsymbol{z}^{\mathcal {R}}_{i,m})^T(\boldsymbol{Q}\boldsymbol{z}^{\mathcal {I}}_{i,k})}{\sqrt{d^{\mathcal {Z}}}}\right) \), where the linear query-key projection \(\boldsymbol{Q}\) is a learned matrix. Then the alignment model \(A^{\mathcal {I}\rightarrow \mathcal {R}}\) uses \(\alpha ^{\mathcal {I}\rightarrow \mathcal {R}}_{i, m, k}\) to compute \(\boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}\) as projected weighted sum of the image region representations \(\boldsymbol{z}^{\mathcal {I}}_{i,k}\):

$$\begin{aligned} \begin{aligned} \boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}= \boldsymbol{O} \left( \sum _{k=1}^{K} \alpha ^{\mathcal {I}\rightarrow \mathcal {R}}_{i, m, k}\left( \boldsymbol{V}\boldsymbol{z}^{\mathcal {I}}_{i,k}\right) \right) \,, \end{aligned} \end{aligned}$$
(1)

where the value projection \(\boldsymbol{V}\), and the output projection \(\boldsymbol{O}\) are learned matrices.

In a similar fashion the cross-modal representations \(\boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,k}\) are computed by \(A^{\mathcal {R}\rightarrow \mathcal {I}}\):

$$\begin{aligned} \begin{aligned} \boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,k}= \boldsymbol{O} \left( \sum _{m=1}^{M_i} \alpha ^{\mathcal {R}\rightarrow \mathcal {I}}_{i, k, m}\left( \boldsymbol{V}\boldsymbol{z}^{\mathcal {R}}_{i,m}\right) \right) \,, \end{aligned} \end{aligned}$$
(2)

with \(\alpha ^{\mathcal {R}\rightarrow \mathcal {I}}_{i, k, m}= \text {softmax}_{m}\left( \frac{(\boldsymbol{Q}\boldsymbol{z}^{\mathcal {I}}_{i,k})^T(\boldsymbol{Q}\boldsymbol{z}^{\mathcal {R}}_{i,m})}{\sqrt{d^{\mathcal {Z}}}}\right) \). Note that as \(A^{\mathcal {R}\rightarrow \mathcal {I}}\) and \(A^{\mathcal {I}\rightarrow \mathcal {R}}\) share the same matrices \(\boldsymbol{Q}\), \(\boldsymbol{V}\), and \(\boldsymbol{O}\), the only difference between \(\alpha ^{\mathcal {R}\rightarrow \mathcal {I}}_{i, k, m}\) and \(\alpha ^{\mathcal {I}\rightarrow \mathcal {R}}_{i, m, k}\) is transposition and the index over which softmax is applied.

3.5 Loss Function

Global Alignment. For global alignment we follow ConVIRT [51] and maximize the cosine similarity between paired image and report representations while minimizing the similarity between non-paired (i.e. from different samples) representations. The loss consists of a image-report part, where all non-paired report representations from the batch are used as negatives:

$$\begin{aligned} \ell ^{\mathcal {I}\Vert \mathcal {R}}_\text {global}&=- \log \frac{ e^{\cos \left( \bar{\boldsymbol{z}}^{\mathcal {I}}_{i}, \bar{\boldsymbol{z}}^{\mathcal {R}}_{i}\right) / \tau } }{ \sum _{j} e^{\cos \left( \bar{\boldsymbol{z}}^{\mathcal {I}}_{i}, \bar{\boldsymbol{z}}^{\mathcal {R}}_{j}\right) / \tau }}\,, \end{aligned}$$
(3)

and a report-image part, defined analogously:

$$\begin{aligned} \ell ^{\mathcal {R}\Vert \mathcal {I}}_\text {global}&=- \log \frac{ e^{\cos \left( \bar{\boldsymbol{z}}^{\mathcal {R}}_{i}, \bar{\boldsymbol{z}}^{\mathcal {I}}_{i}\right) / \tau } }{ \sum _{j} e^{\cos \left( \bar{\boldsymbol{z}}^{\mathcal {R}}_{i}, \bar{\boldsymbol{z}}^{\mathcal {I}}_{j}\right) / \tau }}\,, \end{aligned}$$
(4)

where \(\tau \) is the similarity temperature (we use 0.1) and all logarithms are natural. Both parts are combined using the hyperparameter \(\lambda \in [0, 1]\) (we use 0.75):

$$\begin{aligned} \mathcal {L}_\text {global}&= \frac{1}{N}\sum _{i=1}^N \left[ \lambda \cdot \ell ^{\mathcal {I}\Vert \mathcal {R}}_\text {global} + (1 - \lambda )\cdot \ell ^{\mathcal {R}\Vert \mathcal {I}}_\text {global}\right] \,. \end{aligned}$$
(5)

Local Alignment. The global alignment loss does not only align the global representations but it also prevents the global representations from collapsing to a constant vector using negative samples to contrast the positive pairs. Similarly, we propose local alignment losses encouraging spatial (sentence) sensitivity through negatives from the same sample, i.e. preventing the local representations to be similar for all regions (sentences) of an image (report). We use two NT-Xent-based [6] local losses: \(\mathcal {L}_\text {local-image}\), aligning region representations \(\boldsymbol{z}^{\mathcal {I}}_{i,k}\) with \(\boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,k}\), and \(\mathcal {L}_\text {local-report}\), aligning sentence representations \(\boldsymbol{z}^{\mathcal {R}}_{i,m}\) with \(\boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}\).

Some regions or sentences may not be relevant for aligning a sample (e.g. background regions or sentences not related to the image). Therefore, we introduce region weights \(w^{\mathcal {I}}_{i, k}\) and sentence weights \(w^{\mathcal {R}}_{i, m}\), which are computed as the attention probabilities from the respective attention pooling layer (which was used to compute global representations), averaged over all attention heads. These weights are used in the local loss functions such that irrelevant representations do not have to be aligned. Note that we do not backpropagate through the region or sentence weights.

The loss \(\mathcal {L}_\text {local-image}\) allows for having multiple positive pairs within each sample by giving each pair of regions (kl) a positiveness probability \(p^{\mathcal {I}}_{k, l}\in [0, 1]\). We then treat each positive pair as its own (weighted) example and contrast it with all other pairs (again all logarithms are natural):

$$\begin{aligned} \ell ^{\mathcal {I}\Vert \mathcal {R}\rightarrow \mathcal {I}}_\text {local-image} = -\sum _{l=1}^K p^{\mathcal {I}}_{k, l}\log \frac{ e^{\cos \left( \boldsymbol{z}^{\mathcal {I}}_{i,k}, \boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,l}\right) / \tau '} }{ \sum _{k'} e^{\cos \left( \boldsymbol{z}^{\mathcal {I}}_{i,k}, \boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,k'}\right) / \tau '}} \end{aligned}$$
(6)
$$\begin{aligned} \ell ^{\mathcal {R}\rightarrow \mathcal {I}\Vert \mathcal {I}}_\text {local-image} = -\sum _{l=1}^K p^{\mathcal {I}}_{k, l}\log \frac{ e^{\cos \left( \boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,k}, \boldsymbol{z}^{\mathcal {I}}_{i,l}\right) / \tau '} }{ \sum _{k'} e^{\cos \left( \boldsymbol{z}^{\mathcal {R}\rightarrow \mathcal {I}}_{i,k}, \boldsymbol{z}^{\mathcal {I}}_{i,k'}\right) / \tau '}} \end{aligned}$$
(7)
$$\begin{aligned} \mathcal {L}_\text {local-image}&= \frac{1}{2N}\sum _{i=1}^N \sum _{k = 1}^{K} w^{\mathcal {I}}_{i, k}\cdot \left[ \ell ^{\mathcal {I}\Vert \mathcal {R}\rightarrow \mathcal {I}}_\text {local-image} + \ell ^{\mathcal {R}\rightarrow \mathcal {I}\Vert \mathcal {I}}_\text {local-image}\right] \,. \end{aligned}$$
(8)

Here \(\tau '\) is the similarity temperature and is set to 0.3. We assume that nearby image regions are often similar and that therefore nearby regions are more likely to be positives while distant regions are more likely to be negatives. Thus, we define the positiveness probability \(p^{\mathcal {I}}_{k, l}\) of two image regions as the complementary cumulative exponential distribution of \(d_{\boldsymbol{x}}\) (their spatial \(\ell _2\)-distance in 2D space normalized by the length of the diagonal \(\sqrt{H^2 + W^2}\)) and set \(p^{\mathcal {I}}_{k, l}\) to zero above cutoff threshold \(T \in [0, \infty )\):

$$\begin{aligned} p^{\mathcal {I}}_{k, l}= \frac{ {\mathbb {1}}_{[d_{\boldsymbol{x}}(k, l) \le T]} \cdot e^{-d_{\boldsymbol{x}}(k, l) / \beta } }{ \sum _{k'} {\mathbb {1}}_{[d_{\boldsymbol{x}}(k, k') \le T]} \cdot e^{-d_{\boldsymbol{x}}(k, k') / \beta } }\,. \end{aligned}$$
(9)

Here \(\beta \in (0, \infty )\) is a sharpness hyperparameter. We set \(\beta =1\) and \(T = 0.5\). Note that the normalization of \(d_{\boldsymbol{x}}\) is equal to rescaling T and \(\beta \), i.e. it allows us to define both hyperparameters independently of the image size.

The definition of \(p^{\mathcal {I}}_{k, l}\) is derived by modeling the occurrence of related features at specific distances in the image as a Poisson point process, such that the \(\ell _2\)-distance of related features follows the exponential distribution. We assume a Poisson process due to its property of being memoryless, i.e. knowing that a feature is already related to another feature at some distance does not change how distant additional related features can be found. Also, the probability density function of the exponential distribution is decreasing (with support on the interval \([0, \infty )\)), which seems reasonable as it is typically more likely that related features are near than far. Its cumulative distribution function then describes the probability that two related features are within a given radius and its complementary function that of being outside a given radius. The threshold T assures that very distant pairs do not count as positives. The loss \(\mathcal {L}_\text {local-image}\) thus encourages spatial smoothness of image regions while maintaining spatial sensitivity through negative samples. Note that it is related to the pixel-contrast loss proposed in [49], where the main novelty of our work is the partly smooth definition of \(p^{\mathcal {I}}_{k, l}\) based on the exponential distribution.

The local report loss \(\mathcal {L}_\text {local-report}\) is defined similarly but we do not assume prior knowledge about the similarity of sentences and therefore only have a single positive pair per sentence (again all logarithms are natural):

$$\begin{aligned} \ell ^{\mathcal {R}\Vert \mathcal {I}\rightarrow \mathcal {R}}_\text {local-report} = -\log \frac{ e^{\cos \left( \boldsymbol{z}^{\mathcal {R}}_{i,m}, \boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}\right) / \tau '} }{ \sum _{m'} e^{\cos \left( \boldsymbol{z}^{\mathcal {R}}_{i,m}, \boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m'}\right) / \tau '}} \end{aligned}$$
(10)
$$\begin{aligned} \ell ^{\mathcal {I}\rightarrow \mathcal {R}\Vert \mathcal {R}}_\text {local-report} = -\log \frac{ e^{\cos \left( \boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}, \boldsymbol{z}^{\mathcal {R}}_{i,m}\right) / \tau '} }{ \sum _{m'} e^{\cos \left( \boldsymbol{z}^{\mathcal {I}\rightarrow \mathcal {R}}_{i,m}, \boldsymbol{z}^{\mathcal {R}}_{i,m'}\right) / \tau '}} \end{aligned}$$
(11)
$$\begin{aligned} \mathcal {L}_\text {local-report}&= \frac{1}{2N}\sum _{i=1}^N \sum _{m = 1}^{M_i} w^{\mathcal {R}}_{i, m}\cdot \left[ \ell ^{\mathcal {R}\Vert \mathcal {I}\rightarrow \mathcal {R}}_\text {local-report} + \ell ^{\mathcal {I}\rightarrow \mathcal {R}\Vert \mathcal {R}}_\text {local-report}\right] \, \end{aligned}$$
(12)

Total Loss. The total loss \(\mathcal {L}\) is computed as the weighted sum of global and local losses:

$$\begin{aligned} \mathcal {L}&= \gamma \cdot \mathcal {L}_\text {global} + \mu \cdot \mathcal {L}_\text {local-image} + \nu \cdot \mathcal {L}_\text {local-report}\,, \end{aligned}$$
(13)

where \(\gamma \), \(\mu \), and \(\nu \) are loss weights to balance the individual losses and are set to 1.0, 0.75, and 0.75, respectively. We determined these loss weights by running small grid searches (see supplementary material for details).

4 Evaluation

4.1 Downstream Tasks and Experimental Setup

We evaluate our method on a downstream evaluation framework [30] with 18 localized tasks on chest X-rays, which we will shortly describe here. For more details, we refer to the supplementary material.

Evaluation Protocols. We only evaluate the pre-trained ResNet50 (from the image encoder). For semantic segmentation tasks we use the following evaluation protocols: (i) U-Net Finetune, where the ResNet50 is used as the backbone of a U-Net [35] and is finetuned jointly with all other layers, (ii) U-Net Frozen, where the ResNet50 is used as the frozen backbone of a U-Net [35] and only the non-backbone layers are trained, and (iii) Linear, where an element-wise linear layer is trained that is applied to the feature map of the frozen ResNet50, and then results are upsampled to the segmentation resolution.

For object detection tasks we use the following protocols: (i) YOLOv3 Finetune, where the ResNet50 is used as the backbone of a YOLOv3 [34] model and is finetuned jointly with all other layers, (ii) YOLOv3 Frozen, where the ResNet50 is used as the frozen backbone of a YOLOv3 [34] model and only the non-backbone layers are trained, and (iii) Linear, where the object detection ground truth is converted to segmentation masks and the Linear evaluation protocol is applied.

Downstream Datasets. We evaluate the pre-trained ResNet50 on several medical datasets, namely (i) RSNA Pneumonia Detection [39, 45], with more than 260000 frontal-view chest X-rays with detection targets for pneumonia opacities. We use the YOLOv3 Finetune, YOLOv3 Frozen, and Linear protocols, each with \(1 \%\), \(10 \%\), and \(100 \%\) of the training samples; (ii) COVID Rural [9, 43], with more than 200 frontal-view chest X-rays with segmentation masks for COVID-19 lung opacity regions. We use the UNet Finetune, UNet Frozen, and Linear protocols; (iii) SIIM-ACR Pneumothorax Segmentation [40], with more than 12000 frontal-view chest X-rays with segmentation masks for pneumothorax. We use the UNet Finetune, UNet Frozen protocols, but due not use Linear due to the fine-grained nature of the segmentation masks; (iv) Object CXR [20] with 9000 frontal-view chest X-rays with detection targets for foreign objects. We use the YOLOv3 Finetune, YOLOv3 Frozen, and Linear protocols; (v) NIH CXR [45], with almost 1000 frontal-view chest X-rays with detection targets for eight pathologies (Atelectasis, Cardiomegaly, Effusion, Infiltrate, Mass, Nodule, Pneumonia, and Pneumothorax). Due to the limited data per class, we only use the Linear protocol. The different evaluation protocols are complementary to each other, where the U-Net Finetune and YOLOv3 Finetune protocols evaluate how well suited the pre-trained image models are for fine-tuning as used in practical applications and the Linear protocols evaluate the quality of learned local representations (i.e. feature maps) while adding only a few parameters and therefore mostly omitting the variance introduced by random initialization during downstream evaluation. The U-Net Frozen and YOLOv3 Frozen protocols are a trade-off, where representations are frozen but evaluated in a more practical setting (with several randomly initialized layers).

Tuning and Evaluation Procedure. All baselines and our models have been tuned only on a single downstream task, RSNA YOLOv3 Frozen 10%, where a single fixed downstream learning rate was used (determined in preliminary experiments) and the results of five runs have been averaged. Other downstream tasks have not been evaluated during tuning to make sure that models are not biased towards the downstream tasks. After tuning, we evaluated each model on all downstream tasks: The learning rates were tuned individually per model and task (using single evaluation runs) before running five evaluations per task (all using the tuned learning rate). We report the average results of these five runs and their \(95\%\)-confidence interval (where each evaluation run is considered a sample).

Pre-Training Dataset. We train our method on MIMIC-CXR [13, 22,23,24] (version 2) as, to our best knowledge, it is the largest and most commonly used dataset of this kind. Since all downstream tasks contain only frontal views, we remove all lateral views, such that roughly 21000 training samples remain, each with a report and one or more frontal images.

Baselines. We compare our method against several baseline methods:

  • Random Init.: The ResNet50 is initialized using its default random initialization

  • ImageNet [36] Init.: The ResNet50 is initialized with weights pre-trained on the ImageNet ILSVRC-2012 task [36];

  • CheXpert [19]: The ResNet50 is pre-trained using supervised multi-label binary classification with CheXpert [19] labels on frontal chest X-rays of MIMIC-CXR

  • Global image pre-training methods: The ResNet50 is pre-trained using the self-supervised pre-training methods SimCLR [6] or BYOL [14] on frontal chest X-rays of MIMIC-CXR. We decided to include SimCLR as is uses a similar loss function as LoVT and we include BYOL because of its widespread use.

  • Local image pre-training methods: The ResNet50 is pre-trained using the self-supervised pre-training method PixelPro [49] on frontal chest X-rays of MIMIC-CXR. We include PixelPro to study the effect of local contrastive losses when using only images.

  • Global image-text pre-training methods: The ResNet50 is pre-trained using the image-text methods ConVIRT [51] or CLIP [33] on frontal MIMIC-CXR. Note that for comparability we adapted CLIP to use the same image and text encoders as ConVIRT such that the main difference between CLIP and ConVIRT is that CLIP uses attention pooling to compute the scan representation while ConVIRT uses average pooling. We include both methods as LoVT builds upon a similar general framework, where we include ConVIRT because it targets chest X-rays (like LoVT) and include CLIP because of its widespread use and as it uses (like LoVT) attention pooling in the image encoder. We decided not to include VirTex [8] and ICMLM [38] as they use generative tasks, which have been found to be less effective for discriminative downstream tasks [33].

4.2 Downstream Results

Table 1. Results on the RSNA pneumonia detection tasks with different training set sizes. All results are averaged over five evaluation runs and the 95%-confidence interval is shown. The best results per task are underlined, the second-best results are dash-underlined and the best results per pre-training category (general initialization, pre-training on 30% and 100%) are highlighted in bold. Note that the YOLOv3 Frozen 10% task (task 5) was used for tuning of all methods and may therefore not be representative as methods may overfit on this task.
Table 2. Results on downstream tasks on the COVID Rural, SIIM Pneumothorax, Object CXR, and NIH CXR datasets. All results are averaged over five evaluation runs and the 95%-confidence interval is shown. The best results per task are underlined, the second-best results are dash-underlined and the best results per pre-training category (general initialization, pre-training on 30% and 100%) are highlighted in bold.

We present the downstream results of our model LoVT and the baselines, with pre-training on 100% and 30% of MIMIC-CXR. Table 1 shows the results on different subsets of the RSNA dataset and Table 2 shows the results on the remaining downstream datasets, i.e. on COVID Rural, SIM-ACR Pneumothorax, Object CXR, and NIH CXR.

Comparison of Methods. We found that there is no single pre-training method performing best on all evaluated downstream tasks. On most tasks (15 out of 18) image-text self-supervised methods (i.e. LoVT, CLIP, or ConVIRT) outperform the other methods, such that they should be preferred if paired text is available.

Fig. 3.
figure 3

Spatial smoothness and sensitivity of image region representations. Left: LoVT (Ours). Middle: No local losses. Right: No local losses and no attention pooling. Cosine similarities of image region pairs \(\boldsymbol{y}^{\mathcal {I}}_{i,k}, \boldsymbol{y}^{\mathcal {I}}_{i,k'}\) (each from the same sample) plotted as violin plots (with their width representing the number of pairs and quartiles shown as dashed lines) over their spatial distance in the \(7 \times 7\) image space (normalized and rounded to one decimal digit). We trained all models on 30% of the data and computed the representations on the test set.

Our model LoVT is the best method (over all pre-training settings) on 10 of 18 tasks, and significantly outperforms all other methods in 6 of these tasks, while the second-best method CLIP significantly outperforms all other methods only on 2 tasks. LoVT outperforms image-only methods (i.e. BYOL, SimCLR, and PixelPro) on 14 tasks, where the localized image-only method PixelPro outperforms LoVT only on one task (task 15). On 11 tasks LoVT outperforms other text-supervised methods (i.e. ConVIRT and CLIP), on 14 tasks it outperforms CheXpert classification and on all but two tasks it outperforms ImageNet initialization. When using 100% of the pre-training data LoVT is the best pre-training method on 11 tasks (better by at least the confidence interval on 5 tasks) and when using 30% on 11 tasks (significantly the best on 4 tasks). LoVT performs best on all COVID Rural tasks, best on most Linear tasks, and quite well on the Frozen protocol, but does not perform well on the NIH CXR dataset and when finetuned on the RSNA dataset. As there is no single method performing best on all tasks and LoVT performs best in the majority of tasks, this makes LoVT the default method of choice for localized downstream tasks.

Relevance of Pre-Training Dataset Size. We do not observe a consistent benefit of using roughly 210000 pre-training samples (i.e. 100% of the data) over using roughly 63000 samples (i.e. 30%). While on some datasets like RSNA and Object CXR many methods often perform better when pre-trained on 210000 samples (100%), on other datasets like COVID Rural, methods often perform better when pre-trained on 63000 samples (30%). When comparing LoVT pre-trained on 30% of the data with other methods pre-trained in both settings (i.e. 30% and 100%), we observe that LoVT outperforms image-only methods (i.e. BYOL, SimCLR, and PixelPro) on 12 tasks, other text-supervised methods (i.e. ConVIRT and CLIP) on 7 tasks and CheXpert classification on 12 tasks, showing that LoVT effectively reduces the number of required pre-training samples.

Relevance of Downstream Dataset Size. The results shown in Table 1 suggest that, as expected, larger downstream training sets lead to better results. However, we observe that for text-supervised methods (i.e. LoVT, CLIP, and ConVIRT), the downstream training set size is often less relevant compared to other methods. On the RSAN YOLOv3 Frozen tasks, LoVT (100%) outperforms ImageNet initialization by 31% when using 100% of the downstream samples, while it outperforms ImageNet initialization by even 167% when only using 1% of the samples.

Spatial Smoothness and Sensitivity. We analyze the influence of the local losses and attention pooling on the spatial smoothness and sensitivity of image region representations and therefore plot in Fig. 3 the distributions of the cosine similarity of image region pairs over their spatial distances. For our LoVT model spatial smoothness and sensitivity can be observed as the quartiles and extreme points of the cosine similarity distributions decrease monotonously with increasing spatial distance, except for a few very distant region pairs with distances larger than 0.6. Note that these spatially very distant region pairs very likely represent opposite borders (or corners) of the image such that they both very likely contain background, explaining that they have more similar representations. Without local losses \(\mathcal {L}_\text {local-image}\) and \(\mathcal {L}_\text {local-report}\), the quartiles and extreme points decrease only for small spatial distances while increasing again for points further away, showing that spatial smoothness is only present for nearby regions and spatial sensitivity of more distant region is not optimal. When additionally replacing attention pooling with average (for image regions) and max (for sentences) pooling, similar results can be observed except that the quartiles are decreasing faster and the maximum points do not decrease for nearby regions. We can therefore deduce that the local losses effectively encourage spatial smoothness and sensitivity while attention pooling alone has only little effect.

Analysis of LoVT and Ablation Study. We refer to the supplementary material for a detailed analysis of our method LoVT, including an ablation study (focusing on local weighting, global and local losses, and attention pooling), an analysis of the distribution and alignment of learned representations, and an analysis of the region weights \(w^{\mathcal {I}}_{i, k}\).

5 Discussion

Limitations of Our Evaluation Procedure. In the evaluation procedure, we did not apply extensive hyperparameter tuning, resized all inputs to a resolution of only \(224 \times 224\), and applied no data augmentation. The presented downstream results are therefore below results typically reported on these datasets. We followed [30] and kept the evaluation procedure simple to limit computational resources and avoid bias induced by tuning to allow for a fair comparison of our method with the baselines.

Limitations of LoVT. LoVT learns its alignment model implicitly based only on latent representations and instance-level pairing information. This makes the model sensitive to hyperparameters and hard to train. Also, it only uses local negatives from the same sample which restricts the number of negatives and may therefore limit its performance. Additionally, the alignment model is restricted to a simple attention mechanism and the regions are based on fixed patches that are not adaptive to the contents of the image. This may restrict the capabilities of the model and therefore of the pre-training method. For a detailed discussion of these limitations as well as of the potential negative societal impact we refer to the supplementary material.

Conclusion. We study pre-training for localized medical imaging on chest X-rays and propose a novel text-supervised method called LoVT, that combines instance-level contrastive learning with local contrastive learning. We evaluate our method on 18 localized tasks on chest X-rays and compare it with typically used pre-training and initialization methods. While there is no single best method for all tasks, our method LoVT is the best method on 10 out of 18 studied tasks making it the method of choice for localized tasks.

We hope that our work provides valuable insights that encourage using pre-training for localized medical imaging and that our method inspires future work on localized text-supervised pre-training.