Keywords

1 Introduction

Medical imaging, a powerful diagnostic and research tool creating visual representations of anatomy, has been widely available for disease diagnosis and surgery planning [2]. In current clinical practice, Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) are most commonly used. Since CT and multiple MR imaging modalities provide complementary information, an effective integration of these different modalities can help physicians make more informative decisions.

Since it is difficult and costly to obtain paired multi-modality images in clinical practice, there is a growing demand for developing multi-modality image translations to assist clinical diagnosis and treatment [17].

Existing works can be categorized into two types. One is crossing-modality medical image translation between two modalities, which has scalability issues to the increasing number of modalities [18, 19], since these methods have to train \(n(n-1)\) generator models in order to learn all mappings between n modalities. The other is multi-modality image translation [1, 7, 16, 17]. In this category, some methods [7, 17] rely on paired data, which is hard to acquire in clinical reality. Other methods [1, 16] can learn from unpaired data, however, they tend to lead to deformation in target area without prior knowledge, as concluded by Zhang et al. [19]. As demonstrated in Fig. 1, the state-of-the-art multi-modality image translation methods give rise to poor quality local translations. The translated target area (For example, Liver, in red curves) is blurry, deformed or perturbed with redundant unreasonable textures. Comparing to them, our method can not only perform whole image translation in competitive quality but also achieve significantly better local translation for the target area.

Fig. 1.
figure 1

Translation results (CT to T1w) of different methods are shown here. The target area (i.e., liver) is contoured in red. (Color figure online)

To address the above issues, we present a novel unified general-purpose multi-modality medical image translation method named “Target-Aware Generative Adversarial Networks” (TarGAN). We incorporate target labels to enable the generator to focus on local translation of target area. The generator has two input-output streams. One stream translates a whole image from source modality to target modality, the other focuses on translating a target area. In particular, we combine the cycle-consistency loss [21] and the backbone of StarGAN [1] to learn the generator, which enables our model to scale up to modality increase without relying on paired data. Then, the untraceable constraint [20] is employed to further improve translation quality of synthetic images. To avoid the deformation of output images caused by untraceable constraint, we construct a shape-consistency loss [3] with an auxiliary network, namely shape controller. We further propose a novel crossing loss to allow the generator to focus on the target area when translating the whole image to target modality. Trained in an end-to-end fashion, TarGAN can not only accomplish multi-modality translation but also properly retain the target area information in the synthetic images.

Overall, the Contributions of This Work Are: (1) We propose TarGAN to generate multi-modality medical images with high-quality local translation on target areas by integrating global and local mappings with a crossing loss. (2) We show qualitative and quantitative performance evaluations on multi-modality medical image translation tasks with CHAOS2019 dataset [12], demonstrating our method’s superiority over the state-of-the-art methods. (3) We further use the synthetic images generated from TarGAN to improve the performance of a segmentation task, which indicates that the synthetic images generated by TarGAN achieve the improvement by enriching the information of source images.

Fig. 2.
figure 2

The illustration of TarGAN. As in (b), TarGAN consists of four modules (G, S, \(D_x\), \(D_r\)). The generator G translates a source whole image \(x_s\) and a source target area image \(r_s\) to a target whole image \(x_t\) and a target area image \(r_t\). The detailed structure of G is shown in (a). The shape controller S preserves the invariance of anatomy structures. The discriminators \(D_x\) and \(D_r\) distinguish whether a whole image and its target area are real or fake and determine which modalities the source images come from.

2 Methods

2.1 Proposed Framework

Given an image \(x_{s}\) from source modality s and its corresponding target area label y, we specify a target area image \(r_s\) which only contains the target area by binarization operation \(y \cdot x_{s}\). Given any target modality t, our goal is to train a single generator G that can translate any input image \(x_s\) of source modality s to the corresponding output image \(x_t\) of target modality t, and translate the input target area image \(r_s\) of source modality s to the corresponding output target area image \(r_t\) of target modality t simultaneously, denoted as \(G(x_s, r_s, t) \rightarrow (x_t, r_t)\). Figure 2 illustrates the architecture of TarGAN, which is composed of four modules described below.

To achieve the aforementioned goal, we design a double input-output streams generator G consisting of a shared middle block and two pairs of encoder-decoder. Combining with the shared middle block, both encoder-decoder pairs translate an input image into an output image of the target modality t. One stream’s input is the whole image \(x_{s}\), and the other’s input only includes the target area \(r_s\). The shared middle block is designed to implicitly enable G to focus on target area in whole image translation. Note that target area label y of \(x_s\) is not available in test phase, so the input block \(Encoder_r\) and output block \(Decoder_r\) are not used at that time.

Given a synthetic image \(x_t\) or \(r_{t}\) from G, the shape controller S generates a binary mask which can represent the foreground area of the synthetic image.

Lastly, we use two discriminators denoted as \(D_{x}\) and \(D_{r}\) corresponding to two output streams of G. The probability distributions inferred by \(D_{x}\) distinguish whether the whole image is real or fake, and determine which modality the whole image comes from. Similarly, the \(D_{r}\) distinguish whether the target area image is real or fake, and to determine which modality the target area image comes from.

2.2 Training Objectives

Adversarial Loss. To minimize the difference between the distributions of generated images and real images, we define the adversarial loss as

$$\begin{aligned} \begin{aligned} \mathcal {L}_{adv\_\,x} =&\,\,\mathbb {E}_{x_s}[\text {log}\,D _{src\_\,x}(x_s)]\,+\,\mathbb {E}_{x_t}[\text {log}(1\,-\,D _{src\_\,x}(x_t))], \\ \mathcal {L}_{adv\_\,r} =&\,\,\mathbb {E}_{r_s}[\text {log}\,D _{src\_\,r}(r_s)]\,+\,\mathbb {E}_{r_t}[\text {log}(1\,-\,D _{src\_\,r}(r_t))]. \end{aligned} \end{aligned}$$
(1)

Here, \(D_{src\_\,x}\) and \(D_{src\_\,r}\) represent the probability distributions of real or fake over input whole images and target area images.

Modality Classification Loss. To assign the generated image to their target modality t, we impose the modality classification loss on G, \(D_{x}\) and \(D_{r}\). The loss consists of two terms: modality classification loss of real images which is used to optimize \(D_x\) and \(D_r\), denoted as \(\mathcal {L}_{cls\_(x/r)}^r\), and modality classification loss of fake images which is used to optimize G, denoted as \(\mathcal {L}_{cls\_(x/r)}^f\). In addition, to eliminate synthetic images’ style features from source modalities, the untraceable constraint [20] is combined into \(\mathcal {L}_{cls\_(x/r)}^r\) as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{cls\_\,x}^{r} =&\,\,\mathbb {E}_{x_{s},s}[-\text {log}\,D _{cls\_\,x}(s|x_{s})]\,+\lambda _{u}\,\mathbb {E}_{x_t,s'}[-\text {log}\,D _{cls\_\,x}(s'|x_t)], \\ \mathcal {L}_{cls\_\,r}^{r} =&\,\,\mathbb {E}_{r_s,s}[-\text {log}\,D _{cls\_\,r}(s|r_s)]\,+\lambda _{u}\,\mathbb {E}_{r_t,s'}[-\text {log}\,D _{cls\_\,r}(s'|r_t)]. \end{aligned} \end{aligned}$$
(2)

Here, \(D_{cls\_\,x}\) and \(D_{cls\_\,r}\) represent the probability distributions over modality labels and input images. \(s'\) indicates whether an input image is fake, and is translated from a source modality s [20]. Besides, we define \(\mathcal {L}_{cls\_(x/r)}^{f}\) as

$$\begin{aligned} \mathcal {L}_{cls\_\,x}^{f} = \mathbb {E}_{x_t,t}[-\text {log}\,D _{cls\_\,x}(t|x_t)],\,\, \mathcal {L}_{cls\_\,r}^{f} = \mathbb {E}_{r_{t},t}[-\text {log}\,D _{cls\_\,r}(t|r_{t})]. \end{aligned}$$
(3)

Shape Consistency Loss. Since the untraceable constraint can affect the shape of anatomy structures in synthetic images by causing structure deformation, we correct it by adding a shape consistency loss [3] to G with shape controller S as

$$\begin{aligned} \mathcal {L}_{shape\_\,x} = \mathbb {E}_{x_t, b^x}[||b^x - S(x_t)||_{2}^{2}], \,\, \mathcal {L}_{shape\_\,r} = \mathbb {E}_{r_t, b^r}[||b^r - S(r_t)||_{2}^{2}], \end{aligned}$$
(4)

where \(b^x\) and \(b^r\) are the binarizations (with 1 indicating foreground pixels and 0 otherwise) of \(x_s\) and \(r_s\). S constrains G to focus on the multi-modality mapping in a content area.

Reconstruction Loss. To allow G to preserve the modality-invariant characteristics of the whole image \(x_s\) and its target area image \(r_s\), we employ a cycle consistency loss [21] as

$$\begin{aligned} \mathcal {L}_{rec\_\,x} = \mathbb {E}_{x_{s}, x_s'}[||x_s - x_s'||_{1}], \,\, \mathcal {L}_{rec\_\,r} = \mathbb {E}_{r_s, r_s'}[||r_s - r_s'||_{1}]. \end{aligned}$$
(5)

Note that \(x_s'\) and \(r_s'\) are from \(G(x_t,r_t,s)\). Given the paired synthetic image (\(x_t, r_t\)) and the source modality s, G tries to reconstruct the input images (\(x_s, r_s\)).

Crossing Loss. To enforce G to focus on a target area when generating a whole image \(x_t\), we directly regularize G with a crossing loss defined as

$$\begin{aligned} \mathcal {L}_{cross} = \mathbb {E}_{x_t, r_t, y}[||x_t \cdot y - r_t||_1], \end{aligned}$$
(6)

where y is the target area label corresponding to \(x_s\). By minimizing the crossing loss, G can jointly learn from double input-output streams and share information between them.

Complete Objective. By combining the proposed losses together, our complete objective functions are as follows:

$$\begin{aligned} \mathcal {L}_{D_{(x/r)}} = - \mathcal {L}_{adv\_(x/r)} + \lambda _{cls}^r\,\mathcal {L}_{cls\_(x/r)}^r, \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned} \mathcal {L}_{G}&= \mathcal {L}_{adv\_(x/r)} + \lambda _{cls}^f\,\mathcal {L}_{cls\_(x/r)}^f + \lambda _{rec}\,\mathcal {L}_{rec\_(x/r)} + \lambda _{cross}\,\mathcal {L}_{crossing}, \end{aligned} \end{aligned}$$
(8)
$$\begin{aligned} \mathcal {L}_{G,S} = \mathcal {L}_{shape\_(x/r)}, \end{aligned}$$
(9)

where \(\lambda _{cls}^r\), \(\lambda _{cls}^f\), \(\lambda _{rec}\), \(\lambda _{cross}\) and \(\lambda _u\) (Eqs. (2)) are hyperparameters to control the relative importance of each loss.

Table 1. Quantitative evaluations on synthetic images of different methods. (\(\uparrow \) denotes higher is better, while \(\downarrow \) denotes lower is better)

3 Experiments and Results

3.1 Settings

Dataset. We use 20 patients’ data in each modality (CT, T1-weighted and T2-weighted). They are from the Combined Healthy Abdominal Organ Segmentation (CHAOS) Challenge [11]. Detailed imaging parameters are shown in supplementary material. We resize all slices as \(256 \times 256\) uniformly. \(50\%\) data from each modality are randomly selected as training data, while the rest as test data. Because CT scans only have liver labels, we set liver as the target area.

Baseline Methods. Translation results comparisons are conducted against the state-of-the-art translation methods, StarGAN [1], CSGAN [19] and ReMIC [16]. Note that we implement an unsupervised ReMIC because of the lack of ground-truth images.

Target segmentation performances are also evaluated against the above methods. We train and test models using only real images of each modality, denoted as Single. We use the mean results of two segmentation models of each modality from CSGAN and use the segmentation model \(G_s\) from ReMIC. As for StarGAN and TarGAN, inspired by ‘image enrichment’ [5], we extend every single modality to multiple modalities and concatenate multiple modalities within each sample, as [CT] \(\rightarrow \) [CT, synthetic T1w, synthetic T2w].

Evaluation Metrics. In the translation tasks, due to the lack of ground-truth images, we can not use the common metrics like PSNR, SSIM, etc. So we evaluate both the visual quality and the integrity of target area structures of generated images using Frechét inception distance (FID) [6] and segmentation score (S-score) [19]. We compute FID and S-score for each modality and report their average values. The details on above metrics are further described in supplementary material.

In the segmentation tasks, dice coefficient (DICE) and relative absolute volume difference (RAVD) are used as metrics. We compute each metric on every modality, and report their average values and standard deviations.

Implementation Details.We use U-net [15] as the backbone of G and S. In G, only half of the channels are used for every skip connection. As for \(D_x\) and \(D_r\), we implement the backbone with PatchGAN [9]. Details of above networks are included in the supplementary material. All the liver segmentation experiments are conducted with nnU-Net [8] except CSGAN and ReMIC.

To stabilize the training process, we adopt Wasserstein GAN loss with a gradient penalty [4, 14] using \(\lambda _{gp} = 10\) and two-timescale update rule (TTUR) [6] for G and D. The learning rates for G, S are set to \(10^{-4}\), while that of D is set to \(3 \times 10^{-4}\). We set \(\lambda _{cls}^r=1\), \(\lambda _{cls}^f=1\), \(\lambda _{rec} = 1\), \(\lambda _{cross} = 50\) and \(\lambda _u = 0.01\). The batch size and training epoch are set to 4 and 50, respectively. We use the Adam optimizer [13] with momentum parameters \(\beta _1 = 0.5\) and \(\beta _2 = 0.9\). All images are normalized to [\(-1, 1\)] prior to the training and test. We use exponential moving averages over parameters [10] of G during test, with a decay of 0.999. Our implementation is trained on an NVIDIA GTX 2080Ti with PyTorch.

Fig. 3.
figure 3

Multi-modality medical image translation results. Red boxes highlight the redundant textures, and blue boxes indicate the deformed structures. (Color figure online)

3.2 Results and Analyses

Image Translation. Figure 3 shows qualitative results on each pair of modal image translation. As shown, StarGAN fails to translate image from CT to T1w and produces many artifacts in MRI to CT translation. CSGAN sometimes adds redundant textures (marked by the red boxes) in the target area while retaining the shape of target. ReMIC tends to generate relatively realistic synthetic images while deforming the structure of target area in most cases (marked by the blue boxes). Comparing to above methods, TarGAN generates translation results in higher visual quality and properly preserves the target structures. Facilitated by the proposed crossing loss, TarGAN can jointly learn the mappings of the target area and the whole image among different modalities, and further make G focus on the target areas to improve their quality. Furthermore, as shown in Table 1, TarGAN outperforms all the baselines in terms of FID and S-score, which suggests TarGAN produces the most realistic medical images, and the target area integrity of synthetic images derived from TarGAN is significantly better.

Table 2. Liver segmentation results (mean ± standard deviation) on different medical modalities.

Liver Segmentation. The quantitative segmentation results are shown in Table 2. Our method achieves better performance than all other methods on most of the metrics. This suggests TarGAN can not only generate realistic images for every modality, but also properly retain liver structure in synthetic images. The high-quality local translation for the target areas plays a key role in the improvement of liver segmentation performance. By jointly learning from real and synthetic images, the segmentation models can incorporate more information on the liver areas within each sample.

Ablation Test. We conduct an ablation test to validate effectiveness of different parts of TarGAN in terms of preserving target area information. For ease of presentation, we denote shape controller, target area translation mapping and crossing loss as S, T and C, respectively. As shown in Table 3, TarGAN without (w/o) S, T, C is closely similar to StarGAN except using our implementation. The proposed crossing loss plays a key role in TarGAN, which increases the mean of S-score from TarGAN w/o C \(51.03\%\) to \(64.18\%\).

Table 3. Ablation study on different components of TarGAN. Note that TarGAN w/o S, T and TarGAN w/o T don’t exist, since T is the premise of C.

4 Conclusion

In this paper, we propose a novel general-purpose method TarGAN to mainly address two challenges in multi-modality medical image translation: learning multi-modality medical image translation without relying on paired data, and improving the quality of local translation on target area. A novel translation mapping mechanism is introduced to enhance the target area quality during generating the whole image. Additionally, by using the shape controller to alleviate the deformation problem caused by the untraceable constraint and combining a novel crossing loss in generator G, TarGAN addresses both challenges within a unified framework. Both the quantitative and qualitative evaluations show the superiority of TarGAN in comparison with the state-of-the-art methods. We further conduct a segmentation task to demonstrate effectiveness of synthetic images generated by TarGAN in a real application.