Keywords

1 Introduction

In the medical domain, each imaging modality reflects particular physical properties of the tissue under examination. This results in images with different dimensionality, spatial resolution, and contrast. Various imaging modalities provide a complimentary stream of information for clinical diagnostics or technical pre and post-processing steps. Moreover, acquiring medical images is susceptible to various kinds of noise and modality-specific artefacts. To remedy these issues, translating images between different domains is of great importance.

Inter-modal image-to-image translation can potentially replace additional acquisition procedures, reducing examination costs and time. Besides, intra-modality image-to-image translation enables complex artefact and noise correction. For example, attenuation correction of positron emission tomography (PET) data is challenging in situations where no density distribution is available from computed tomography (CT) data, as in the case for stand-alone PET scanners or combined PET/magnetic resonance imaging (MRI). In these situations, the generation of pseudo-CTs from PET data can be helpful. Further examples are related to image reconstruction and/or correction in MRI: Reconstruction of undisturbed artifact-free images is hard to achieve with traditional methods; deep-learning-based image-to-image translation can solve this challenge. In particular, generative adversarial networks (GAN) based on convolutional neural networks (CNN) have proven to provide a high visual quality of the generated synthetic images. However, predictions of GANs can be unreliable, and particularly in medical applications, the quantification of uncertainty is of high importance for the interpretation of the results. In this work, we propose a generic end-to-end model that introduces high-capacity conditional progressive GANs to synthesize high-quality images, using aleatoric uncertainty estimates as the guide to focus on improving image quality in regions where the network is highly uncertain about the prediction. We perform experiments on three challenging and vital medical imaging tasks: PET to CT translation, undersampled MRI reconstruction, and motion correction in MRI. Moreover, we empirically demonstrate the efficacy of our model under weak supervision with limited data.

2 Related Works

Traditional machine learning techniques for medical image translation rely on explicit feature representations [6, 12, 17, 35]. More recently, convolutional neural networks have been proposed for various image translation tasks [3,4,5, 8, 13, 19] and state-of-the-art performance is achieved by generative adversarial networks [1, 2, 7, 9, 15, 21, 22, 29,30,31,32]. The existing methods propose conditional GAN architectures with deterministic outputs that typically uses \(\mathcal {L}_1/\mathcal {L}_2\)-based fidelity loss for the generator assumes a pixel-wise homoscedasticity and also assumes the pixel-wise error (i.e., residual) to be independent and identically distributed (i.i.d) following a Laplace or Gaussian distribution. This is a limiting assumption as explained in [10, 23, 25]. While these methods can provide synthetic images of high visual quality, the image content may still deviate significantly from the corresponding ground-truth. This results in overconfidence or misinterpretation with negative consequences, particularly in the medical domain. There have been recent works on quantifying aleatoric and epistemic uncertainty in task-specific medical imaging algorithms like classification, segmentation, super-resolution etc. [14, 20, 25,26,27] quantifying it for general image-to-image translation problem largely remains unexplored. Thus, the central motivation of our work is to provide measures of uncertainty for image-to-image translation tasks that can contribute to safe applications of results.

Moreover, recent work has shown that high-capacity generators that are progressive in nature lead to high-quality results as described in [1, 2, 9]. However, the progressive generation of high-quality images remains unguided without specifically attending to poorly translated regions. Prior works indicate a correlation between estimated uncertainty and prediction error [20, 23, 33]. We exploit this relationship for the progressive enhancement of synthetic images, which has not been investigated by prior work before.

3 Uncertainty-Guided Progressive GAN (UP-GAN)

Let A and B be two image domains with a set of images \(S_{A} := \{a_1, a_2 ... a_n\}\) and \(S_{B} := \{b_1, b_2 ... b_m\}\) where \(a_i\) and \(b_i\) represent the \(i^{th}\) image from domain A and B respectively. Let each image drawn from an underlying unknown probability distribution \(\mathcal {P}_{AB}\), i.e., \((a_i, b_i) \sim \mathcal {P}_{AB} \forall i\) have K pixels, and \(u_{ik}\) represent the \(k^{th}\) pixel of a particular image \(u_i\). Our goal is to learn a mapping from domain A to B (\(A \rightarrow B\)) in a paired manner, i.e., learning the underlying conditional distribution \(\mathcal {P}_{B|A}\) from the set of given samples \(\{(a_i, b_i)\}\), following the distribution \(\mathcal {P}_{AB}\). For a given image \(a_i\) in domain A, the estimated image in domain B is called \(\hat{b}_i\). The pixel wise error is defined as \(\epsilon _{ij} = \hat{b}_{ij} - b_{ij}\). While the existing framework models the residual as the i.i.d as described above, we relax that assumption by modelling the residual as non i.i.d variables and learning the optimal distribution from the dataset, as described in the following.

Fig. 1.
figure 1

Uncertainty-guided Progressive GANs (UP-GAN): The primary GAN takes the input image from domain A, while subsequent GANs absorb outputs from the preceding GAN (see Eq. 3 and 4). Explicitly guided by the attention maps, the uncertainty maps are estimated from the preceding GAN.

Figure 1 shows our model that consists of cascaded GANs, where each generator is capable of estimating the aleatoric uncertainty, along with generating images. Our solution alleviates the aforementioned limitations of recent methods by modelling the underlying per-pixel residual distribution as independent but non-identically distributed zero-mean generalized Gaussian distribution (GGD) as in [23], where the network learns to predict the optimal scale (\(\alpha \)) and shape (\(\beta \)) of the GGD for every pixel, Therefore, \(\hat{b}_{ij} = b_{ij} + \epsilon _{ij}\) with, \( \epsilon _{ij} \sim GGD(\epsilon ; 0, \alpha _{ij}, \beta _{ij}) \equiv \beta _{ij}(2\alpha _{ij}\varGamma (\beta ^{-1}_{ij}))^{-1} \exp {\left( -\alpha _{ij}^{-1}|\epsilon |^{\beta _{ij}}\right) }\quad \). We generate images in multiple phases, with each phase generating output images along with the aleatoric uncertainty estimates. The outputs from one phase serve as the input to the subsequent GAN in the next phase, explicitly guided by the attention map derived from uncertainty estimates. Importantly, this uncertainty-based guidance enforces the model to focus on refining the uncertain regions that are likely to be poorly synthesized, resulting in progressively improving quality.

Our framework is composed of a sequence of M GANs, where the \(m^{th}\) GAN is represented by a pair of networks, generator and discriminator, given by, \((\mathcal {G}_m(\cdot ; \theta _m), \mathcal {D}_m(\cdot ; \phi _m))\). Both the generator and discriminator can have arbitrary network architecture as long as generator can estimate aleatoric uncertainty as described in [23]. We choose all the discriminators to be the patch discriminators from [7] and generators to be modified U-Net [16], where the head is split into three to estimate the parameters of the GGD as shown in Fig. 1 and in [23].

Primary GAN. We train the first GAN (\(\mathcal {G}_0\)) using the dataset \(S_A\) and \(S_B\). The predictions of the generator are given by \((\hat{\alpha }_{[0]i},\hat{\beta }_{[0]i},\hat{b}_{[0]i})\). The network is trained with an adaptive fidelity loss function \(\mathcal {L}^G_{\alpha \beta }\) [23] and an adversarial loss \(\mathcal {L}^G_{\text {adv}}\) [36], combined as \(\mathcal {L}^G_{\text {tot}}\) for the generator (\(\mathcal {G}_0(\cdot ; \theta _0): A \rightarrow B\)):

$$\begin{aligned}&\mathcal {L}^G_{\alpha \beta }(\hat{b}_{[0]i}, \hat{\alpha }_{[0]i}, \hat{\beta }_{[0]i}, b_{i}) = \frac{1}{K}\boldsymbol{\sum }_{j} \left( \frac{|\hat{b}_{[0]ij}-b_{ij}|}{\hat{\alpha }_{[0]ij}} \right) ^{\hat{\beta }_{[0]ij}} - \log \frac{\hat{\beta }_{[0]ij}}{\hat{\alpha }_{[0]ij}} + \log \varGamma (\hat{\beta }_{[0]ij}^{-1}) \qquad \end{aligned}$$
(1)
$$\begin{aligned}& \mathcal {L}_{\text {adv}}^G = \mathcal {L}_2(\mathcal {D}_1(\hat{b}_{[0]i}),1) \text { and } \mathcal {L}^G_{\text {tot}} = \lambda _1 \mathcal {L}^G_{\alpha \beta } + \lambda _2 \mathcal {L}_{\text {adv}}^G . \end{aligned}$$
(2)

The patch discriminator (\(\mathcal {D}_1\)) is trained using the adversarial loss from [36] given by \(\mathcal {L}^D_{\text {adv}} = \mathcal {L}_2(\mathcal {D}^A(b_i),1) + \mathcal {L}_2(\mathcal {D}^A(\hat{b}_{[0]i}),0)\).

Subsequent GANs. The \(m^{th}\) GAN (where \(m > 0\)) takes the output produced by the \((m-1)^{th}\) GAN, i.e. \((\hat{\alpha }_{[m-1]i},\hat{\beta }_{[m-1]i},\hat{b}_{[m-1]i})\), along with the original sample \(a_i\) from domain A as its input and generates a refined output. The image estimated by the \((m-1)^{th}\) GAN along with its uncertainty map learns to create the input feature \(f_{[m]i}\) for the \(m^{th}\) GAN, where the uncertainty map serves as an attention mechanism to highlight the uncertain regions in the image. The input \(a_{[m]i}\) for the \(m^{th}\) generator is given by concatenating \(a_i\) and \(f_{[m]i}\), i.e.,

$$\begin{aligned}&\hat{\sigma }_{[m-1]i} = \hat{\alpha }_{[m-1]i}\sqrt{ \frac{\varGamma (3/\hat{\beta }_{[m-1]i})}{\varGamma (1/\hat{\beta }_{[m-1]i})}} \text {, and } f_{[m]i} = \hat{b}_{[m-1]i} \odot \frac{\hat{\sigma }_{[m-1]i}}{\boldsymbol{\sum }_j \hat{\sigma }_{[m-1]ij}} \end{aligned}$$
(3)
$$\begin{aligned}& a_{[m]i} = \mathtt {concat}(f_{[m]i}, a_i) \end{aligned}$$
(4)

The input \(a_{[m]i}\) for the \(m^{th}\) GAN encourages the generator to further refine the highly uncertain regions in the image given the original input context. The generator and the discriminator are trained using \(\mathcal {L}^G_{\text {tot}}\) and \(\mathcal {L}^D_{\text {adv}}\), respectively.

Progressive Training Scheme. We initialize the parameters \(\theta \cup \phi \) sequentially. First, we initialize \(\theta _1 \cup \phi _1\) using the training set \((S_A, S_B)\) to minimize the loss function given by \(\mathcal {L}^G_{\text {tot}}\) and \(\mathcal {L}^D_{\text {adv}}\). Then, for the subsequent GANs, we initialize the \(\theta _m \cup \phi _m\) (\(m>1\)) by fixing the weights of all the previous generators and training the \(m^{th}\) GAN alone (see Eq. 3 and 4 with losses \(\mathcal {L}^G_{\text {tot}}\) and \(\mathcal {L}^D_{\text {adv}}\)). Once all the parameters have been initialized (i.e., \(\theta _m \cup \phi _m \forall m\)), we do further fine tuning by training all the networks end-to-end by combining the loss functions of all the intermediate phases and a significantly smaller learning-rate.

4 Experiments

In this section, we first detail the experimental setup and comparative methods in Sect. 4.1, and present the corresponding results in Sect. 4.2.

4.1 Experimental Setup

Tasks and Datasets. We evaluate our method on the following three tasks.

  1. (i)

    PET to CT translation: We synthesize CT images from PET scans to be used for the attenuation correction, e.g. for PET-only scanners or PET/MRI. We use paired data sets of non-attenuation-corrected PET and the corresponding CT of the head region of 49 patients acquired on a state-of-the-art PET/CT scanner (Siemens Biograph mCT), approved by ethics committee of the Medical Faculty of the University of Tübingen. Data is split into 29/5/15 for training/val/test sets. Figure 2 shows exemplary slices for co-registered PET and CT.

  2. (ii)

    Undersampled MRI reconstruction: We translate undersampled MRI images to fully-sampled MRI images. We use MRI scans from the open-sourced IXIFootnote 1 dataset that consists of T1-weighted (T1w) MRI scans. We use a cohort of 500 patients split into 200/100/200 for training/val/test, and retrospectively create the undersampled MRI with an acceleration factor of \(12.5\times \), i.e., we preserve only \(8\%\) of the fully-sampled k-space measurement (from the central region) to obtain the undersampled image.

  3. (iii)

    MRI Motion correction: We generate sharp images from motion corrupted images. We retrospectively create the motion artefacts in the T1w MRI from IXI following the transformations in the k-space as described in [18]. Figure 3-(ii) shows the input MRI scan with artefacts and ground-truth.

Training Details and Evaluation Metrics. All GANs are first initialized using the aforementioned progressive learning scheme with \((\lambda _1, \lambda _2)\) in Eq. 2 set to (1, 0.001). We use Adam [11], with the hyper-parameters \(\beta _1 := 0.9\), \(\beta _2 := 0.999\), an initial learning rate of 0.002 for initialization and 0.0005 post-initialization that decays based on cosine annealing over 1000 epochs, using a batch size of 8. We use three widely adopted metrics to evaluate image generation quality: PSNR measures \(20 \log {\text {MAX}_I/\sqrt{\text {MSE}}}\), where \(\text {MAX}_I\) is the highest possible intensity value in the image and \(\text {MSE}\) is the mean-squared-error between two images. SSIM computes the structural similarity between two images [28]. MAE computes the mean absolute error between two images. Higher PSNR, SSIM, and lower MAE indicate a higher quality of the generated images (wrt ground-truth).

Compared Methods. We compare our model to representative state-of-the-art methods for medical image translation, including Pix2pix [7], a baseline conditional adversarial networks for image-to-image translation tasks using GANs, PAN [24], and MedGAN [2], a GAN-based method that relies on external-pre-trained feature extractors, with a generator that refines the generated images progressively. MedGAN is shown to perform superior to methods like, Fila-sGAN [34], ID-cGAN [32], and achieve state-of-the-art performance for several medical image-to-image translation problems.

Fig. 2.
figure 2

Outputs from different phases of UP-GAN (with M = 3). (Top) The input (uncorrected PET), the corresponding ground-truth CT, mean residual values over different phases, mean uncertainty values over different phases. (Bottom) Each row shows the predicted output, the residual between the prediction and the ground-truth, the predicted scale (\(\alpha \)) map, the predicted shape (\(\beta \)) map, the uncertainty map, and the uncertainty in high residual regions.

Fig. 3.
figure 3

Qualitative results. (Top) PET to CT translation. (Bottom) Undersampled MRI reconstruction (left), and MRI motion correction (right).

4.2 Results and Analysis

Qualitative Results. Figure 2 visualizes the (intermediate) outputs of the generators at different phases of the framework. The visual quality of the generated image content increasingly improves along the network phases (as shown in the first column, second row onward). At the same time, prediction error and uncertainty decrease continuously (second column and fifth column, second row onward, respectively). High uncertainty values are found in anatomical regions with fine osseous structures, such as the nasal cavity and the inner ear in the petrous portion of the temporal bone. Particularly in such regions of high uncertainty, we achieve a progressive improvement in the level of detail.

Figure 3-(Top) visualizes the generated CT images from the PET for all the compared methods along with our methods. We observe that more high-frequency features are present in our prediction compared to the previous state-of-the-art model (MedGAN). We also observe that the overall residual is significantly lower for our method compared to the other baselines. MedGAN performs better than pix2pix in synthesizing high-frequency features and sharper images. Figure 3-(Bottom) shows similar results for the undersampled MRI reconstruction task and MRI motion correction task. In both cases, our model yields superior images, as can be seen via relatively neutral residual maps.

Quantitative Results. Table 1 shows the quantitative performance of all the methods on the three tasks; for all the tasks, our method outperforms the recent models. In particular, for the most challenging task, PET to CT translation, our method with uncertainty-based guide outperforms the previous state-of-the-art method, MedGAN (that relies on task-specific external feature extractor), without using any external feature extractor. Therefore, the uncertainty guidance reduces the burden of having an externally trained task-specific feature extractor to achieve high fidelity images. The same trend holds for undersampled MRI reconstruction and motion correction in MRI. The statistical tests on SSIM values of MedGAN and our UP-GAN gives us a p-value of 0.016 for PET-to-CT translation, 0.021 for undersampled MRI reconstruction, and 0.036 for MRI motion correction. As all the p-values are \(<0.05\), results are statistically significant.

Ablation Study. We study the model that does not utilize the estimated uncertainty maps as attention maps and observe that the model without the uncertainty as the guide performs inferior to the UP-GAN with a performance (SSIM/PSNR/MAE) of (0.87/25.4/40.7), (0.93/27.3/38.7), and (0.92/26.2/35.1) for PET to CT translation, undersampled MRI reconstruction, and MRI motion correction, respectively. UP-GAN model leverages the uncertainty map to refine the predictions where the model is uncertain, which is also correlated to the regions where the translation is poor. The model without uncertainty-based guidance does not focus on the regions mentioned above in the prediction and is unable to perform as well as UP-GAN.

Fig. 4.
figure 4

Quantitative results in the presence of limited labeled training data.

Table 1. Evaluation of various methods on three medical image translation tasks.

Evaluating Models with Weak Supervision. We evaluate all the models for PET to CT synthesis by limiting the number of paired image samples used for training. We define five supervision levels corresponding to different amounts of cross-domain pairwise training sample slices. For this experiment, we train the recent state-of-the-art models with a varying number of patients in the training stage, i.e., we use 5, 10, 15, 20, and 29 patients, respectively. Figure 4 shows the performance of all the models at varying supervision levels. We observe that our model with uncertainty guidance outperforms all the baselines at full supervision (with 29 patients). Moreover, our model sharply outperforms the baselines with limited training data (with < 29 patients). UP-GAN produces intermediate uncertainty maps that have higher values under weak supervision (compared to the full supervision case), but this still allows UP-GAN to focus on highly uncertain regions, that the current state-of-the-art models do not have access to, hence are not able to leverage that to refine the predicted images.

5 Conclusion

In this work, we propose a new generic model for medical image translation using uncertainty-guided progressive GANs. We demonstrate how uncertainty can serve as an attention map in progressive learning schemes. We demonstrate the efficacy of our method on three challenging medical image translation tasks, including PET to CT translation, undersampled MRI reconstruction, and motion correction in MRI. Our method achieves state-of-the-art in various tasks. Moreover, it allows the quantification of uncertainty and shows better generalizability with smaller sample sizes than recent approaches.