Keywords

1 Introduction

Chest X-ray is the most common medical imaging study globally for conducting clinical routines to assess chest regions. Because of its popularity, large, labeled datasets such as ChestX-ray14 dataset [24], CheXpert [10], OpenI-IU [5], and MIMIC-CXR [11, 12], were collected as benchmarks for data-driven deep learning models to archive expert-level performance in analyzing chest regions. Among these biomedical datasets, OpenI-IU and MIMIC-CXR contain radiology reports along with corresponding radiographs. Given the large size of collected images and manual labeling being impractical, the disease labels are usually derived using natural language processing tools applied to the corresponding radiology reports.

Recently, self-supervised representation learning has been explored to extract underlying information from the data by performing proxy tasks that explore the organization of the data itself. This is a promising direction for learning from a large amount of unlabeled biomedical data, where manual labeling is tedious, time-consuming, subjective, and requires domain knowledge. Self-supervised learning provides a great potential to investigate the biomedical data, including both medical images and their associated reports, accumulated during clinical routines. Ideally, both the modalities of the data encode the same medical condition and should be cross-referable.

Self-Attention mechanism was introduced to find the cross references within the same data modality [23]. This concept has contributed tremendously to the recent success of natural language processing models, such as BERT [6]. These models are pre-trained by predicting masked tokens to learn the underlying semantic representations from unlabeled textual data. Once the representation learning models are pre-trained, they can be fine-tuned and used as a backbone for a wide range of downstream natural language processing tasks.

Motivated by the above discussion, we propose to establish the cross references of the chest radiology images and reports to jointly learn the image-text representations. Learning cross-modal visual and textual representation is an essential task that can combine the semantic information contained within images and their descriptive reports [16, 17]. These approaches have also been explored in biomedical image analysis [18]. The proposed representation learning mechanism will provide the foundation for a wide range of biomedical vision-and-language tasks, such as clinical inter-modal and intra-modal image-text retrieval, medical visual question answering [1], and automatic clinical report generation [19].

Contributions: We propose JoImTeRNet - a self-supervised pre-training network trained on multimodal inputs. Our network extracts and fuses the representations of the visual and textual modalities using both global image-sentence matching and local attention-based region-phrase matching. Phrases vary from length of one to three words. The proposed local region-phrase alignment enhances the joint representation learning by performing automatic fine-grained matching between image region-of-interests with phrases in reports. The local region-phrase matching is further enhanced using a soft-attention mechanism in the image encoder, without the need for explicit manual bounding box annotation or object detection on images. The quality of the learned representation is tested on the downstream classification and retrieval tasks.

2 Joint Image Text Representation Learning Network

We propose a Joint Image Text Representation Learning Network (JoImTeRNet) shown in Fig. 1. The JoImTeRNet architecture consists of an image and a text encoder. The representations are matched through a list of matching tasks, Text to Image Matching (TIM), Masked Language Modeling (MLM), Phrase to Region Alignment (PRA), and Word to Region Alignment (WRA). The learned image and text representation are mapped to a shared feature space, given the hypothesis that the radiographs and their corresponding report contain consistent semantic meaning.

Fig. 1.
figure 1

The architecture of the proposed JoImTeRNet.

Given an X-ray image \(\text {I}\) and its corresponding radiology report \(\text {T}\), we first encode them with an image encoder \(F_I\) and a text encoder \(F_T\). The image encoder contains one input convolution layer, 6 residual blocks [8] and a global average pooling (GAP) layer. In the meantime, we also get the output from the Soft-Attention (SA) [22] block placed after the ResBlock5 to extract the region features \(r \in \mathbb {R}^{D \times M}\), such that \(v,r=F_I(\text {I})\) where \(v \in \mathbb {R}^{D}\) is the global image features from GAP. Sentence and word level features s and w are extracted using a Transformer [23] based text encoder \(F_T\), such that \(w, s = F_{T}(\text {T})\), where \(w \in \mathbb {R}^{D \times N}\) and \(s \in \mathbb {R}^{D}\). Three transformer layers are deployed in \(F_T\) to encode the text report with the self-attention mechanism.

2.1 Matching Images and Sentences

To learn the joint representation of image and text pairs, we use the cross-entropy based matching (CEM) loss [26] and the ranking-based triplet matching (TM) loss [3]. Given a batch of image-text pairs \({(\text {I}_i,\text {T}_i)}_{i=1}^B\) (B is the batch size) and their corresponding visual features v and sentence features s from \(F_I\) and \(F_T\), probability of \(\text {T}_i\) matching with \(\text {I}_i\) using softmax as: the image-to-text CEM loss \(L_{CEM}^{\text {IT}}\) is defined as the negative log posterior probability of the images being matched with their corresponding texts, i.e.,

$$\begin{aligned} L_{CEM}^{\text {IT}} = -\sum _{i=1}^B log(P(\text {T}_i|\text {I}_i)) = -\sum _{i=1}^B log(\frac{e^{\gamma S(\text {I}_i,\text {T}_i)}}{\sum _{j=1}^Be^{\gamma S(\text {I}_i,\text {T}_j)}}) \end{aligned}$$
(1)

where \(\gamma \) is the smoothing factor. \(P(\text {T}_i|\text {I}_i)\) is the posterior probability of \(\text {T}_i\) matching with \(\text {I}_i\) using softmax. Cosine similarity \(S(\text {I}_i,\text {T}_i)=(v^T s)/(\Vert v\Vert \Vert s\Vert )\) is used as the similarity score between image-text pairs. During the training, \(\text {T}_i\) is the correct match to \(\text {I}_i\) in the batch and all the other \(\text {T}_{j}(j\ne i)\) are mismatching texts. Considering that image-text joint representation mapping should be bidirectional, we reverse \(\text {I}\) and \(\text {T}\) in Eq. (1) and get the symmetric text-to-image CEM loss as \(L_{CEM}^{\text {TI}}\). Thus, the bidirectional CEM loss for globally matching image and text is defined as \(L_{CEM}^s=L_{CEM}^{\text {IT}}+L_{CEM}^{\text {TI}}\).

Although CEM loss is designed to make the similarity between correct image-text pairs relatively higher than other mismatched pairs, it is difficult to set a hard margin between mismatched features. To solve this problem, TM loss [3], a ranking-based criterion, is added to increase the distance of mismatched pairs in the joint embedding space. Given an image \(\text {I}_i\) as the anchor, \(\text {T}_i\) is used as the positive paired sample. We then randomly select a mismatching text \(\text {T}_{j} (j\ne i)\) within the batch as the negative paired sample. Symmetrically, if \(\text {T}_i\) is used as the anchor, then \(\text {I}_i\) and \(\text {I}_{j}\) would be positive/negative samples. The bidirectional TM loss for global image-text matching is formed as:

$$\begin{aligned} \begin{aligned} L_{TM}^s = L_{TM}^{\text {IT}} + L_{TM}^{\text {TI}}&= \sum _{i,j=1}^{B}\Big [\text {max}(0,S(\text {I}_i,\text {T}_{j})-S(\text {I}_i,\text {T}_i)+\eta _s) \Big .\\&\Big .\quad \quad \quad + \text {max}(0,S(\text {I}_{j},\text {T}_i)-S(\text {I}_i,\text {T}_i)+\eta _s)\Big ] \end{aligned} \end{aligned}$$
(2)

where \(\eta _s\) is the hard margin and S is the cosine similarity the same as in Eq. (1).

2.2 Aligning Image Regions and Report Phrases

Both chest X-rays and their corresponding reports contain lots of fine-grained semantic information. We introduce a region-phrase level matching to align different concepts in the text reports with the regions of the images to further improve the joint representation. We apply region-phrase alignment with both CEM loss and TM loss. The length of a phrase is in the range of 1 to 3 words. Features of words, bigram and trigram phrases are denoted as \(w, p_2, p_3\) respectively.

The cosine similarity between regions and words/phrases is not feasible to calculate directly due to the lack of explicit mapping between them. Instead, an attention-based matching score is deployed to overcome this challenge [7, 9, 26]. For region-word-level matching, given \((\text {I}_i,\text {T}_i)\) and their region-word features (rw), we first calculate the similarity matrix between all possible pairs of region features and word features using dot-product, i.e., \(m=w^Tr\), where \(m\in \mathbb {R}^{N\times M}\), which is further normalized along N words as \(\bar{m}=\text {Softmax}_N(m)\). Next, a context feature c is computed as the weighted sum over region features r, weighted by the region-word attention score \(\alpha \) as follows:

$$\begin{aligned} \begin{aligned} c = \alpha r^T \text {, where } \alpha _{i,j} = \frac{e^{\gamma _1 \bar{m}_{i,j}}}{\sum _{k=0}^{M-1} e^{\gamma _1 \bar{m}_{i,k}}} \end{aligned} \end{aligned}$$
(3)

where \(c \in \mathbb {R}^{N\times D}\) and \(\alpha \in \mathbb {R}^{N\times M}\); \(\gamma _1\) is a hyper-parameter to tune the required amount of visual attention for a word. Here, the \(i^{th}\) vector of c is the attention-weighted representation of all the sub-regions related to the \(i^{th}\) word.

The attention-based region-word-level matching score is computed as:

$$\begin{aligned} S_a(\text {I},\text {T}) = \log (\sum _{i=1}^{N-1}e^{(\gamma _{2}S(c_i,w_i))})^{\frac{1}{\gamma _2}} \end{aligned}$$
(4)

where \(S(c_i,w_i)=(c_i^T w_i)/(\Vert c_i\Vert \Vert w_i\Vert )\), is the element-wise cosine similarity score between \(c_i\) and \(w_i\), \(\gamma _2\) is the importance magnification hyper-parameter for the most relevant word and context vector pair.

By replacing the cosine similarity score \(S(\cdot ,\cdot )\) with the region-word matching score \(S_a(\cdot ,\cdot )\) in Eq. (1) (2), we obtain the bidirectional CEM loss and TM loss for region-word alignment as \(L_{CEM}^{p_1}=L_{CEM}^{rw}+L_{CEM}^{wr}\) and \(L_{TM}^{p_1}=L_{TM}^{rw}+L_{TM}^{wr}\).

Furthermore, we obtain the phrase features by applying a 1D convolutional layer with kernel size 2 and 3 over w to get bigram \(p_2 = \theta _{p_2}^Tw\) and trigram \(p_3 = \theta _{p_3}^Tw\) phrase features respectively [14, 27]. Here, \(\theta _{p_2}, \theta _{p_3}\) are the convolution kernels of size 2 and 3. Our final cross-entropy with triplet matching (CETM) loss for our image-text joint representation learning is designed as:

$$\begin{aligned} L_{CETM} = \lambda _{CEM}( L_{CEM}^s + \sum _{i=1}^{3}L_{CEM}^{p_{i}}) + \lambda _{TM}( L_{TM}^s + \sum _{i=1}^{3}L_{TM}^{p_{i}}) \end{aligned}$$
(5)

where \(\lambda _{CEM}\) and \(\lambda _{TM}\) are the loss weight hyper-parameters.

2.3 Downstream Task

In order to demonstrate the performance of joint representation learning, we use the pre-trained image and text encoders as the backbone and test the learned features on multi-label classification. We add projection layers followed by two fully connected layers for multi-label classification. Cross-entropy loss balanced with positive/negative ratio and class-wise weights [25] are used for training.

3 Experiments

3.1 Datasets

MIMIC-CXR v2.0 [11], is a large public dataset consisting of 377,110 chest X-rays associated with 227,835 radiology reports. We limit our study to the frontal-view images and only keep one frontal view image for each report. Following the pre-processing scheme in [3], we extract the impressions, findings, conclusion and recommendation sections from the raw report, normalized by SciSpaCy [20], and concatenate them. If none of these sections are present, we use the final report section. 14 CheXpert labels provided in MIMIC-CXR are used for classification task, where label 1 is considered as positive and all the other labels (−1, 0) and missing labels are merged as negative. This results in 222,252 image-report pairs with 14 binary labels. We split the dataset into 217,252, 2,000 and 3,000 samples for training, validation and testing respectively.

OpenI-IU. [5] is a public dataset with 3,996 radiology reports and 8,121 associated chest X-ray images, which are manually annotated by human experts using MeSH words. Similar to TieNet [25], only unique frontal images and their corresponding reports which contain either findings and/or impressions are selected. This yields 3,643 image-report pairs, which are only used as external evaluation sets. For comparison and evaluation purposes, we select the 7 common labels in both OpenI and MIMIC-CXR (Table 3) from the MeSH domain.

Table 1. Ablation for selecting the best loss setting. The matching score for OpenI-IU and MIMIC-CXR is computed on 1000 and 1000/3000 test samples respectively. Subscript swp stand for image-text, region-word and region-phrase level matching.

3.2 Implementation Details

JoImTeRNet is implemented in Pytorch [21] and all the experiments are carried out on NVIDIA GTX 1080 Ti GPUs. For \(F_I\), we use the basic residual blocks proposed in [8]. We employ 3 layers of Transformer blocks with 8 heads in \(F_T\). The input image is encoded into 256 regions (r) flattened from \(16 \times 16\) feature map output from Res Block 5 as shown in Fig. 1. The input image is cropped or padded to \(2048 \times 2048\) then normalized to [−1, 1]. Random crop, rotation and color jitter are used for data augmentation. Report input is tokenized by a word-level tokenization scheme, where we collect all the words that appear more than twice in MIMIC-CXR dataset, which results in vocabulary size of 8, 410. The input reports are truncated or padded to the max length of \(N=160\).

Parameter Settings. We pretrain \(F_I\) and \(F_T\) on MIMIC-CXR training set using the image-text matching task explained in Sect. 2.1 and 2.2 to generate the joint image and text representations. The maximum epoch is set as 30. We employ AdamW [15] optimizer with an initial learning rate of \(10^{-4}\), which is dropped by 10 times after 20 epochs. L2 weight decay is set as \(10^{-4}\). For the downstream classification task in Sect. 2.3, we set up two different settings for comparison: randomly initializing the backbone and fine-tuning the pre-trained backbone. The learning rate for the classification head in both settings is set to \(10^{-4}\). For the randomly initialized setting, we train the backbone using the same learning rate as the classification block for 20 epochs, whereas the pre-trained backbone is fine-tuned with a smaller learning rate of \(10^{-5}\) for only 10 epochs. Our model pre-trained with the full loss setting CETMwps is used as the backbone for fine-tuning. The batch size is set to 32 for all the experiments. We select the loss hyper-parameters as \(\gamma ,\gamma _1,\gamma _2=2,1,1\), \(\eta _s, \eta _w=0.5,0.5\) and \(\lambda _{CEM},\lambda _{TM}=2.0, 1.0\).

Table 2. Classification AUCs on MIMIC-CXR [11] dataset. “FS” stands for training from scratch (FS). “FT” stands for fine-tuned model. Other comparison experiments are Visualbert [17], Uniter [4], and ClinicalBert [2].

3.3 Performances

Evaluation Metric. We evaluate the performance of JoImTeRNet by cross-modality retrieval task: given one image (text) as a query, we rank a subset of text (image), including the paired one, based on cosine similarity between the image and text features from JoImTeRNet. Recall@K (R@K) [13] is reported, where \(K \in \{1,5,10\}\), which measures the fraction of times the correct matching is retrieved among the top K results in the test set. We compute R@K on a subset of 1000 image-text pairs and on the full 3000 samples in our MIMIC-CXR test set. We also report R@K on a subset of 1000 samples in OpenI-IU in order to evaluate JoImTeRNet on the external dataset (Table 1).

Ablation Study for Loss Settings. Ablation studies for different combinations of our losses are listed in Table 1. As we can see, full loss setting CETMwps achieves the highest R@5 and R@10 scores on all the test set, which shows the effectiveness of our multilevel phrase matching loss. In addition, the matching performance degrades when the model is trained on global matching loss only without the region-phrase(word)-level matching, i.e. CEMs performs worse than CEMws. Similar results are found when comparing TMs with TMws. This result shows that our proposed method for assisting joint representation learning using region-word matching is able to improve the representation ability of the image-text encoder. Moreover, the CETM combination consistently gains performance compared with only CEM loss or TM loss settings, which is just as we expected in Sects. 2.1 and 2.2. Notice that the matching scores are much lower on OpenI, since OpenI contains a large amount of similar reports, e.g. ‘No acute disease.’, which can have very similar feature representation from our model and thus largely degrade the matching score.

Table 3. Classification AUCs on OpenI-IU [5] dataset. “FS” stands for training From Scratch (FS). “FT” stands for Fine-tuned model. Other comparison experiments are ChestX-ray14 [24], TieNet [25], Visualbert [17], Uniter [4], and ClinicalBert [2].

Downstream Image Classification Results. The AUCs from our two settings on both datasets along with other SOTA performances are shown in Table 2 and 3. We can see that the classifier performance finetuned on JoimTerNet backbone (FT) is always higher than training from scratch (FS), which shows the advance of our pre-training method. As shown in Table 2, FT achieves the highest AUCs on most tasks and labels on MIMIC-CXR test set (internal evaluation), even better than some SOTA models [2, 4, 17] on image-text and text classification. For the external evaluation on OpenI in Table 3, our FT setting extremely improves average AUC on image classification by 18% compared with TieNet [25], and also gains 1% on wAvg AUC than ClinicalBERT [2] on report classification. For the image-text classification, our model is still comparable with other SOTA models, even though our text encoder only contains 3 transformer layers compared with [4, 17] which has a 12 layer BERT encoder as the backbone.

4 Conclusion

We propose a joint image-text representation learning network and show its performance on cross-modality retrieval and multi-label classification. We demonstrate the potential of self-supervised learning when it meets the continuously generated biomedical images and reports. We also leverage and show the importance of information contained within the relationship of words, phrases and image regions. Future work includes more complicated downstream tasks regarding both images and text.