Keywords

1 Introduction

The promise of automated deep learning systems to assist radiologists is enormous. At the moment, important milestones, such as better consistency or even better performance have been achieved on an increasing number of use-cases [18, 37]. A source of inspiration in further improvement of these efforts is the way humans register and analyze images, which for deep learning has shown to be effective in the past [17, 37].

In any analysis, a doctor provides the memory and knowledge to place what is currently seen in the context of what has been seen before. In principle this can be compared to what implicitly happens at scale in any deep learning method. A doctor’s analysis is not implicit though. Their analysis process can be described and verified. We wonder whether (medical) deep learning methods could benefit from an explicit memory/knowledge infusion.

Making deep learning methods more explicit in terms of using past observations has already been studied in Natural Language Processing (NLP), in the form of retrieval augmentation [14, 21]. Supplementing data by retrieving relevant retrieved information can lead to performance gains [4]. This process can be thought of to work as both an enrichment and regularization process. A benefit of retrieval augmentation is that context from a trusted knowledge source is used as a supplement [13, 29]. The versatility of retrieval augmentation, which essentially provides a non-parametric memory expansion, is gaining traction in the multi-modal field [4, 28].

Multi-modal data modalities typically have different strengths leading to a strong and a weak data modality [37]. For instance, radiology reports generally contain richer and more complete information than X-rays, since the report is essentially a clinician’s annotation [24]. With retrieval augmentation information can be transferred explicitly from the strong to the weak modality.

A reason retrieval augmentation methods are not yet adopted for medical applications lies in the weakness of retrieval methods for the medical domain. Retrieval in the general domain is focused on global image regions [8, 16] whereas in medical images global features, such as body/organ structure are similar across patients. Meanwhile more fine-grained aspects are more discriminating as disease indicators, but are easily overlooked. The need for fine-grained results makes medical image retrieval magnitudes more complex.

We propose X-Ray Task Retrieval Augmentation (X-TRA), a framework for retrieval augmentation in a multi-modal medical setting, specifically designed for X-ray and radiology report analysis. To do so we introduce a cross-modal retrieval model and retrieval augmentation method. We make the following contributions.

  • We propose a CLIP-based multi-modal retrieval framework with a dedicated fine-tuning component for efficient content alignment of medical information which improves state-of-the-art results in multi- and single-modal retrieval on radiology images and reports.

  • We introduce a multi-modal retrieval augmentation component for disease classification and report retrieval pipelines.

  • We show that our method (1) reaches state-of-the-art performance both in multi-label disease classification and report retrieval. (2) Our report retrieval is competitive with dedicated report generation methodologies. (3) We show the cross-dataset versatility and the limitations of our method.

2 Related Work

Multi-modal Alignment. The introduction of Transformers for natural language processing (NLP) accelerated the development of integrated vision-language (VL) alignment models suitable for various VL-tasks, such as ViLBERT [19], LXMERT [30] and SimVLM [33]. These methods provide alignment on region to sentence- or word-level scale. The next step in multi-modal alignment was made by methods using contrastive learning combined with substantially larger datasets. Examples are CLIP [27] and ALIGN [10] which significantly outperform existing methods by using datasets for training consisting of 400M and 1.8B VL-pairs respectively. Domain-specific versions of CLIP, which is open-source, have been fine-tuned with additional data, such as PubMedCLIP [3].

Retrieval Augmentation. The origin of retrieval augmentation lies in the NLP field. It was created to fully utilise the power of large datasets. With retrieval augmentation we are not only dependent on a parametric model, but can also supplement data as a non-parametric component. Previous methods have shown the simple yet effective and versatility working of retrieval augmentation in a number of applications [5, 13, 29]. Retrieval in Medical Imaging. Up until recently the only retrieval methods in medical imaging were tailored hand-crafted methods [16]. With access to large datasets and pre-trained methods the balance shifted towards making automated retrieval methods [6, 26]. Especially in the histopathology and radiology domain major strides were made with retrieval methods [2, 8]. The use of text to improve image retrieval has been adopted for improving chest X-ray retrieval. Yu et al. [35] use CNN and word2vec features for multi-modal alignment and retrieval. Zhang et al. [36] approach this problem with a hash-based retrieval method.

Retrieval for Chest X-ray Analysis. Common tasks in chest X-ray analysis are disease classification and report generation [1, 11, 15]. Using retrieval for report generation has been a common approach. The approaches often entail the use of retrieved information as an input or template for a decoder which crafts a custom report [23, 32, 34]. Augmentation of chest X-ray tasks with synthetically generated diffusion-based images was shown to be possible [1], however the clinical use of non-genuine images can lead to complications and is not undisputed [37].

3 Methods

Our method is composed of two separate parts (Fig. 1). The first part is the alignment of the two modalities and construction of the retrieval model. The second part uses the output of the retriever as a non-parametric component in (cross-modal) retrieval augmentation to enhance the downstream tasks.

We consider a dataset \(\varTheta ^{N}_{\{\textbf{x},\textbf{r}\}}\) consisting of pairs containing an X-ray (\(\textbf{x}_i\)) and radiology report (\(\textbf{r}_i\)). To align these modalities we make use of the powerful CLIP vision-language aligner. Our objective is to minimize the distance between \(\textbf{x}\) and \(\textbf{r}\), to make cross-modal tasks possible. These aligned features will be used for retrieval augmentation to do multi-label classification and report retrieval as downstream tasks.

Fig. 1.
figure 1

Architecture overview of X-TRA.

3.1 Stage I: Multi-modal Content Alignment

We leverage the pre-trained features from CLIP for initial feature representations. However, there is a domain shift between the natural image data CLIP is trained on and medical images we want to use in our method. Medical images can be visually very similar, while holding drastically different information. Small localized markers can be indicators for disease. In natural images global representations are more decisive and thus more suitable for unsupervised contrastive alignment. Alignment in CLIP goes as follows [27],

$$\begin{aligned} \mathcal {L}_{CLIP}=-\frac{1}{N} \sum _{z \in Z}\sum _{i=1}^N \log \frac{\textrm{e}^{\left( {\text {sim}}\left( z_i^0, z_i^1\right) / \tau \right) }}{\sum _{j=1}^N \textrm{e}^{ \left( {\text {sim}}\left( z_i^0, z_j^1\right) / \tau \right) }}\;\; with \;Z = \{(\textbf{x},\textbf{r}),(\textbf{r},\textbf{x})\}. \end{aligned}$$
(1)

We need to overcome the obvious domain shift between medical images and the natural images on which CLIP is trained. Therefore, we require a more specific type of fine-tuning that is especially geared towards content-based extraction. We introduce the following loss, requiring a global class label for each dataset. With this fine-tuning step we are creating a supervised content-based alignment method with content classifier C:

$$\begin{aligned} \qquad \mathcal {L}_{ours} = -\frac{1}{N}\sum _{z \in Z}\sum _{i=1}^N y_i log_{e}(\widehat{C(z_i)})\qquad \;\;\; with \; Z = \{\textbf{x},\textbf{r},(\textbf{x},\textbf{r})\}.\end{aligned}$$
(2)

This content based alignment loss should improve the alignment of detailed content-level details over the global visual appearance of the image.

Creating a Retrieval Index. At retrieval time we need to retrieve images that have a high similarity with query images. To efficiently do so we make use of Facebook AI Similarity Search (FAISS) [12]. This retrieval tool efficiently performs nearest-neighbour similarity search. After multi-modal alignment we encode our data to a FAISS index I conditioned on the training dataset. We can construct indices that only retrieve images (\(I^{\textbf{x}}\)), only reports (\(I^{\textbf{r}}\)), or both (\(I^{\textbf{x}\textbf{r}}\)).

Given a query \(\mathcal {Q}_s\) in source modality s, we can obtain its k neighbours of target modality t through:

(3)

this can be either \(\textbf{x}\), \(\textbf{r}\) or both. Once retrieval index I is trained based on the newly aligned training dataset we can consider the retriever as a non-parametric component which retrieves information from a fixed dataset in the subsequent retrieval augmentation steps. Note that during testing time, a query from the test set will be used to retrieve neighbours from the training set.

3.2 Stage II: Retrieval Augmentation

The purpose of retrieval augmentation is to effectively leverage similar representations to adopt a more informative representation of a given input, with our already trained retrieval index we retrieve similar representations.

To obtain a richer representation of \(\textbf{x}_i\), we retrieve intra- and inter-modal neighbours from \(I^{\textbf{x}}\) and \(I^{\textbf{r}}\) respectively. To integrate the retrieved neighbouring samples, we can use various fusion methods [25]. The simplest one is concatenation: . A more suitable method is multi-head attention (MHA) which is able to capture the long range dependencies between the original image and the retrieved information [31]:

(4)

3.3 Downstream Tasks

We are tackling two common tasks in chest X-ray analysis. These are multi-label disease classification and report retrieval. For this last task our objective is to show how well a retriever can perform on the report generation task. We measure performance by comparing task performance of \(\textbf{x}^{TRA}\) in comparison to \(\textbf{x}\).

A useful property of our retrieval index would be usability of an pre-trained model across datasets. Three clinically relevant scenarios for this are: From scratch training on the new dataset, frozen usage of the trained retrieval model and fine-tuning of the existing retrieval model with another image-report dataset.

3.4 Datasets

The primary dataset to which our method is applied is MIMIC-CXR (200k image-report pairs) [11]. Disease labels for each pair are extracted from the report through a rule-based extraction method [9]. To evaluate the versatility and cross-domain capabilities of our method, we use the small openI (4k image-report pairs) [20] and image-only CheXpert (200k images) [9] datasets. Official train-test splits are used.

3.5 Experimental Setup

As pre-processing step, the X-ray images are normalized and standardized by rescaling with center-cropping to scale \(256\times 256\), from which images of size \(224\times 224\) are sampled. The maximum number of tokens for representing radiology reports in the text encoder is set to 256. Three different VL models are used as encoders. At first a CNN-BERT model, composed of a DenseNet121 image encoder and a ClinicalBERT [7] text encoder. Given the strong performance of large vision-language models we also use CLIP (ViT-32 image encoder and text encoder) [27] and its medically fine-tuned equivalent PubMedCLIP [3]. This model is fine-tuned using the Radiology Objects in COntext (ROCO) dataset [22].

Multi-modal alignment is implemented as a single pass through a two-layer ReLu activated MLP, with dimension \(z_{enc}\), a dropout rate of 0.5, and layer normalization. \(z_{enc}\) is the output dimension of the encoder. We implement C as a three layer classifier head with dimensions \(\{z_{enc},256,14\}\). During retrieval we make use of \(k=10\) retrieved neighbours. To prevent overfitting, early stopping with a tolerance of 3 is applied to all training operations.

4 Results

4.1 Cross-Modal Retrieval

We are comparing the performance of our retrieval method against previous methods in Table 1 in terms of class-based mean average precision (mAP). Due to the powerful alignment of CLIP and tailor made fine-tuning we are outperforming all existing retrieval approaches for radiology images and/or reports by a large margin. The performance difference with similarly fine-tuned encoder-decoder combination DenseNet121 and ClinicalBERT further underwrites the power of CLIP in building a strong retrieval method, specifically on cross-domain retrieval. Interestingly, we observe that PubMedCLIP is not outperforming CLIP. This can be explained by a domain shift between MIMIC-CXR and ROCO, together with the ability of CLIP to generalize well out-of-domain [27]. In our downstream tasks image-based retrieval is most important, which is performing similar on inter- and intra-modal retrieval tasks.

Table 1. Class-based retrieval performance (source \(\rightarrow \) target) for images (\(\textbf{x}\)) and reports (\(\textbf{r}\)) in terms of mAP on MIMIC-CXR on our content alignment method, compared against other methods.

4.2 Multi-label Disease Classification

Disease classification results in terms of AUC in Table 2 show that retrieval augmentation gives a clear improvement across different disease classes. It is interesting to see that we find a positive, albeit weak, correlation (R\(\approx \)0.60) between the increase in class AUC performance and retrieval mAP. Moreover, the performance gain from retrieval augmentation (\(0.80\rightarrow 0.85\)) is similar to additional training with synthetic diffusion-generated X-rays (\(0.80\rightarrow 0.84\)) [1]. The benefit of our method is that the supplemented information originates from the trusted dataset itself and is not synthetically generated.

Table 2. Chest X-ray classification on MIMIC-CXR with and without retrieval augmentation. The results show the beneficial effect of retrieval augmentation on classification performance.

4.3 Report Generation

In retrieval augmented report retrieval we show interesting performance on the report generation metrics compared to a selection of previous methods. While it should not be expected that simple retrieval outperforms dedicated report generation methods we are able to provide a result that can be considered competitive (Table 3). On the METEOR and ROUGE metric we are even outperforming most existing methods. The metrics reflect that the strength of report retrieval is in the global representation of the report. Our retriever is fine-tuned to retrieve samples with equivalent label spaces, hence good results on metrics that reward global similarity. An interesting outlook is the application of this method in a dedicated report generation framework which could boost performance further.

Table 3. Chest X-ray report retrieval on MIMIC-CXR with and without X-TRA retrieval augmentation. Compared to dedicated report generation methods.

4.4 Cross-Dataset

By evaluating the cross-dataset scenarios (Table 4) with the CheXpert and openI datasets we can conclude that transferability to images from other domains is limited. However we do see that if retrieval augmentation is not useful, it can be ignored by the model and will not be detrimental for performance. The domain shift between different chest X-rays is a remaining problem [24]. Currently the most practical solution for this problem is the addition of a fine-tuning step.

Cross-domain results on open-I show that learning across modalities is possible with fine-tuning. When adding the openI dataset to the existing retrieval index, we can integrate the existing index with this new dataset. We can see that X-TRA benefits openI in this setting. In the updated retrieval index 23% of the retrieved information originates from openI and 77% from MIMIC-CXR.

Table 4. Cross-domain result on downstream tasks: Report retrieval (RR) and multi-label classification (MLC) with and without X-TRA.
Fig. 2.
figure 2

Ablation studies on X-TRA on disease classification, for five different random seeds, with (a) different compositions of the retrieval index for \(\mathcal {L}_{CLIP}\) and \(\mathcal {L}_{ours}\) and (b) partial usage of the retrieval index.

4.5 Ablation Studies

We study the effect of the components in our retrieval augmentation method in Fig. 2. Specifically we look at the influence of each component in content- and CLIP based alignment. Interestingly, the composition of data modalities in retrieval augmentation does not have a big effect, since the retriever has similar results in inter- and intra-modal retrieval. In case randomly selected data is used instead of retrieved information, we achieve comparable results compared to our method without X-TRA. This is in accordance with cross-modal results, showing that if X-TRA supplemented information is not useful, it can be ignored. Using a partial retrieval index we can conclude that X-TRA can be useful with a small retrieval index, however performance reaches optimal levels when \(N>100k\).

4.6 Insight and Limitations

Qualitative results from our retrieval method for 2 different query images is shown in Fig. 3. We retrieve from the image index and report index. The retrieved images match well in terms of labels attributed to them, showing that our fine-tuning is preventing the retrieval of images that are only globally similar.

Fine-tuning of the entire CLIP model to domain-specific data is an interesting prospective. Potentially this can further improve the performance of our retrieval model. However, as we have shown in this paper regarding the performance of CLIP against PubMedCLIP, the loss of generalization can also be detrimental. In future studies this an promising avenue to explore.

Fig. 3.
figure 3

Examples of image-image and image-text retrieval including disease class labels. A green outline means a correct retrieval, orange or dashed means a missed or extra disease label respectively. (Color figure online)

5 Conclusion

In this work we present X-TRA, a simple yet effective method to improve multiple tasks on radiology images. Our method is composed of a content alignment and a retrieval augmentation step. With a new label-based alignment loss we are able to leverage pre-trained CLIP features to create a powerful cross-modal retrieval model. The general CLIP model appears to be more useful for our retrieval model than the slightly out-of-domain medically fine-tuned PubMedCLIP. We use this retrieval model to improve chest X-ray analysis through retrieval augmentation. With this we are adding an enrichment and regularization component that improves both multi-label disease classification and report retrieval by up to over 5%. On this last task we are even showing to be competitive with dedicated report retrieval methods. It opens up possibilities for retrieval augmentation as a generic tool in medical imaging.