Keywords

1 Introduction

In medical imaging, the task of obtaining diagnostic images from multiple modalities is necessary for accurate and comprehensive prediction of disease diagnosis. For example, T1-weighted (T1) brain images provide clear differentiate images of gray and white matter tissues, whereas T2-weighted (T2) images differentiate fluid from cortical tissue. By leveraging the information provided by both of these image modalities, we can gain a more in-depth and completed picture of the diagnosis. However, obtaining separately both images is often costly, time-consuming, and maybe corrupted by noise and artifacts. Therefore, cross-modalities synthesis is a promising application to improve the clinical feasibility and utility of multi-contrast MRI. Image-to-image translation has recently gained attention in the medical imaging community, where the task is to estimate the corresponding image in the target domain from a given source domain image of the same subject. Generally, the image-to-image translation methods can be divided into two categories including: Generative Adversarial Networks (GANs) and Flow-based Generative Networks and summarized as follows:

Generative Adversarial Networks GANs are a class of latent variable generative models that clearly identify the generator as deterministic mapping. The deterministic mapping represents an image as a point in the latent space without regarding its feature ambiguity. Several different GAN-based models have been used to explore image-to-image translation in a literature study [2, 3, 14, 16]. For example, Zhu et al. [16] proposed a cycleGAN method for mapping between unpaired domains by using cycle-consistency dependence to constrain the optimal solutions provided by the generative network. Balakrishnan et al. [2] proposed a RecycleGAN to explore the temporal information by learning a prediction of the next frame for video generation. Chen et al. [3] proposed a 3D cycleGAN network to learn the mapping between CT and MRI. The drawback of 3D cycleGAN is it is memory consumption and loses the global information due to working on small patch sizes.

Flow-based Generative Networks are a class of latent variable generative models that clearly identify the generator as an invertible mapping. The invertible mapping provides a distributional estimation of features in the latent space. Recently, many efforts making use of flow-based generative networks have been proposed to transfer between two unpaired data [4, 5, 7, 10, 12]. For example, Grover et al. [5] introduced a flow to flow (alignflow) network for unpaired image-to-image translation. Sun et al. [12] introduced a conditional dual flow-based invertible network to transfer between positron emission tomography (PET) imaging and magnetic resonance imaging (MRI) images. By using invertible properties, the flow-based methods can ensure exact cycle consistency in translation from a source domain to the target and returning to the source domain without any further loss functions.

Limitations of Existing Methods and Our Contributions. The primary drawback of the cycleGAN model is that it can not perform one-to-one mapping for accurate and unique unpaired image translation, generates biased image translations of the inverse mapping [11]. Different from the GANs-based method, the flow-based method guarantees precise cycle consistency in mapping data points from a source domain to the target and returning to the source domain. However, the flow-based methods do not take into account the temporal information between consecutive slices. To address this problem, we propose a new method by inheriting the merits of the flow-based method and exploiting temporal information between consecutive slices. Our approach provides more constraints to the optimization for transforming one domain to another domain. To capture temporal information, we employ a deformation field between consecutive slices by training a convolutional neural network. In our proposed approach, the deformation field plays a role of guidance to keep slices realistic and consistent across translation.

2 Related Work

2.1 Cycle-Consistent Adversarial Networks (cycleGAN)

Let \(\{x_i\}_{i=1}^N\) and \(\{y_i\}_{i=1}^M\) be unpaired data samples for two domains, i.e. the source domain X and the target domain Y, respectively. Denote D and G as a discriminator network and a generator network. The cycleGAN model [16] solves unpaired image-to-image translation between these two domains by estimating two independent mapping functions \(G_{X \rightarrow Y}: X\rightarrow Y\) and \(G_{Y \rightarrow X}: Y\rightarrow X\). The two mapping functions \(G_{X \rightarrow Y}\) and \(G_{Y \rightarrow X}\) performed by neural networks are trained to fool the discriminator \(D_X\) and \(D_Y\) respectively. The discriminator \(D_X\), and \(D_Y\) encourage the transferred images and the real images to be similar. Hence, the cycleGAN loss is defined as:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{cycleGAN}(G_{X \rightarrow Y},G_{Y \rightarrow X},D_X,D_Y) = \mathcal {L}_{GAN}(G_{X \rightarrow Y},D_Y) +\mathcal {L}_{GAN}(G_{Y \rightarrow X},D_X) \\&\quad \quad \quad \quad \quad \quad \quad \quad +\lambda \mathcal {L}_{cycle}(G_{X \rightarrow Y},G_{Y \rightarrow X}) +\beta \mathcal {L}_{identity}(G_{X \rightarrow Y},G_{Y \rightarrow X}) \end{aligned} \end{aligned}$$
(1)

where \(\mathcal {L}_{GAN}\) is a GAN loss for the D network [16]. \(\mathcal {L}_{cycle}\) is a cycle consistency loss that guarantees the transferred image from a time-point is able to bring back to the original image after appearance translation by the generator network G. For example, the cycle consistency loss of the data translated from \(X \rightarrow Y\) via \(G_{X}\) and mapped back to the original domain X via \(G_{Y}\) is defined as:

$$\begin{aligned} \mathcal {L}_{cycle}(G_{X \rightarrow Y},G_{Y \rightarrow X})=\left\Vert G_{Y \rightarrow X}(G_{X \rightarrow Y} (x)) - x \right\Vert _{1} \end{aligned}$$
(2)

The identity loss \(\mathcal {L}_{identity}\) is to regularize the generator to be near an identity mapping when real samples of the target domain are given as the input to the generator. The \(\lambda \) and \(\beta \) control the contribution of the two objective functions.

2.2 Flow-Based Generative Models

Flow-based Generative Models are a class of latent variable generative models that clearly identify the generator as an invertible mapping \(h: Z \rightarrow X\) between a set of latent variables Z and a set of observed variables X. Let \(p_X\) and \(p_Z\) indicate the marginal densities given by the model over X and Z, respectively. Using the change-of-variables formula, these marginal densities are defined as

$$\begin{aligned} p_{X}(x) = p_{Z}(z) \bigg |\det \frac{\partial h^{-1}}{\partial X}\bigg |_{X=x} \end{aligned}$$
(3)

where \(z=h^{-1}(x)\) because of the invertibility constraints. In particular, we use a multivariate Gaussian distribution \(p_{Z}(z) = \mathcal {N} (\mu ,\,0, \,\mathbf{I} )\). Unlike adversarial training, flow models trained with maximum likelihood estimation (MLE) explicitly require a prior \(p_{Z}(z)\) with a tractable density to evaluate model likelihoods using the change-of-variables formula in the Eq. (3).

Based on flow-based method [4], Grover et al. [5] proposed an alignflow method for unpaired image-to-image translation. In the method, the mapping between two domains \(X \rightarrow Y\) can be represented through a shared feature space of latent variables Z by the composition of two invertible mapping [5]:

$$\begin{aligned} G_{X \rightarrow Y} = G_{Z \rightarrow Y} \circ G_{X \rightarrow Z}, \quad \quad \quad G_{Y \rightarrow X} = G_{Z \rightarrow X} \circ G_{Y \rightarrow Z} \end{aligned}$$
(4)

where \(G_{X \rightarrow Z}= G_{Z \rightarrow X}^{-1}\) and \(G_{Y \rightarrow Z}= G_{Z \rightarrow Y}^{-1}\). Due to the fact that composition of invertible mappings is invertible, both \(G_{X \rightarrow Y}\) and \(G_{Y \rightarrow X}\) are invertible [5]. On the other hand, we can obtain \(G_{X \rightarrow Y}^{-1} = G_{Y \rightarrow X}\). Thus the Eq. (2) can rewrite as

$$\begin{aligned} \begin{aligned} \mathcal {L}_{cycle}(G_{X \rightarrow Y},G_{Y \rightarrow X})&=\left\Vert G_{Y \rightarrow X}(G_{X \rightarrow Y} (x)) - x \right\Vert _{1} \\&=\left\Vert G_{X \rightarrow Y}^{-1}(G_{X \rightarrow Y} (x)) - x \right\Vert _{1} = 0 \end{aligned} \end{aligned}$$
(5)

where \(G_{X \rightarrow Y}^{-1}G_{X \rightarrow Y}\) results in an identical matrix.

Equation 5 implies that the flow-based methods can guarantee precise cycle consistency in mapping from a source domain to the target and returning to the source domain without additional loss functions. Hence, the alignflow objective loss is defined as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{flow}(G_{X \rightarrow Y},G_{Y \rightarrow X},D_X,D_Y)&= \mathcal {L}_{GAN}(G_{X \rightarrow Y},D_Y) +\mathcal {L}_{GAN}(G_{Y \rightarrow X},D_X) \\&- \lambda _{X}\mathcal {L}_{MLE}(G_{Z \rightarrow X})- \lambda _{Y}\mathcal {L}_{MLE}(G_{Z \rightarrow Y}) \end{aligned} \end{aligned}$$
(6)

where \(\lambda _{Y}, \lambda _{Y} \ge 0\) are hyperparameters that control the importance of the MLE terms for domains X and Y respectively.

Fig. 1.
figure 1

A comparison between (a) cycleGAN and (b) alignflow generative model. Double-headed arrows denotes an invertible mapping

Figure 1 illustrates the difference between cycleGAN and alignflow methods. Unlikes cycleGAN, the alignflow method is the full invertible architecture that guarantees the cycle-consistency translations between two unpaired domains without an additional \(\mathcal {L}_{cycle}\) function.

3 Proposed Method

Our motivation is to learn a mapping between unpaired images from different domains by leveraging the temporal information between consecutive slices. We use the temporal information to constrain the mapping between two domains which should be consistent. Our method is an extension of alignflow [5] method with making use of temporal information between consecutive slides.

Fig. 2.
figure 2

Deformation Guided Temporal Constraints for domain Y

3.1 Deformation Guided Temporal Constraints

To obtain the displacement between consecutive slices, we use an unsupervised registration network [1] to learn a deformation field \(\phi \) of a slice \(x_t\) and its consecutive slices \(x_k\). The deformation field \(\phi \) can be obtained using a convolutional neural network (CNN) [1] by minimizing the loss function

(7)

where denotes the spatial transformation operation. The first term ensures that the distance between the next slice \(x_t\) and the warped current slice to be close. The second term imposes regularization on \(\phi (.)\).

To guarantee the consistency of the image translation, the \(\mathcal {L}_1\) loss is used to measure the difference between the warping of fake images on consecutive slice \(t^{th}\) and the translation of reference slice \(k^{th}\). We define the temporal consistency loss function for mapping \(X \rightarrow Y\) and \(Y \rightarrow X\) as:

(8)

Figure 2 illustrates an example for image-to-image translation from domain \(X \rightarrow Y\) using temporal constraints. Let \(x_t, x_{t+1}, x_{t+2}\) be consecutive slices of real images in the source domain X. A mapping function \(G_{X \rightarrow Y}\) generates the fake image \(y_t, y_{t+1}, y_{t+2}\) on target domain Y. On the source domain, we can learn displacement fields \(\phi _t(.), \phi _{t+2}(.)\) between \((x_t, x_{t+1})\) and \((x_{t+2}, x_{t+1})\). To constrain the consistency of the mapping from \(X \rightarrow Y\), we minimize the distance (i) between the warped fake image and \(y_{t+1}\) for mapping from \(t^{th}\) slice and \((t+1)^{th}\) slice, and (ii) between the warped fake image and \(y_{t+1}\) for mapping from \((t+2)^{th}\) slice and \((t+1)^{th}\) slice.

Fig. 3.
figure 3

Our flow-based deformation guidance approach for unpaired image-to-image translation.

3.2 Network Diagram

Figure 3 illustrates the proposed network diagram for unpaired image-to-image translation. Our proposed network architecture inherits the advantages of invertible property of alignflow [5]. During training, we add two additional networks \(Reg_X\) and \(Reg_Y\) for each domain to learn the deformation field \(\phi (.)\). These additional networks only use in training time, without increasing the model complexity and inference time comparison with the baseline flow-based method. The temporal constraint via \(\mathcal {L}_{reg}(.)\) losses ensures the mapping of consecutive slices on the source domain should be consistent on the target domain. Finally, our objective function is defined as:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{flow\_reg}(G_{X \rightarrow Y},G_{Y \rightarrow X},D_X,D_Y, \phi )= \mathcal {L}_{flow}(G_{X \rightarrow Y},G_{Y \rightarrow X},D_X,D_Y) \\&\quad \quad \quad \quad +\lambda _{1}\mathcal {L}_{reg}(X, G_{X\rightarrow Y})+\lambda _{2}\mathcal {L}_{reg}(Y, G_{Y\rightarrow X}) + \beta _{1}\mathcal {L}_X(\phi ) + \beta _{2}\mathcal {L}_Y(\phi )\\&\quad \quad \quad \quad +\gamma _1 \mathcal {L}_{TV}(X) + \gamma _2 \mathcal {L}_{TV}(Y) \end{aligned} \end{aligned}$$
(9)

where \(\lambda _{1}\), \(\lambda _{2}\), \(\beta _{1}\), and \(\beta _{2}\) control the relative importance of the temporal consistence losses and the two registration losses. \(\mathcal {L}_{TV}\) denotes total variation (TV) loss to impose spatial smoothness by measuring the horizontal and vertical gradient of generated images [15]. These TV losses are weighted by \(\gamma _1, \gamma _2\).

4 Experimental Results

4.1 Datasets and Training

We used common medical datasets to measure the robustness of our method against the existing methods: cycleGAN [16], recycleGAN [2], cycleflow [11] and alignflow [5]. cycleGAN [16] is an unpaired image-to-image translation that works on single slice level. RecycleGAN [2] built upon the cycleGAN and add a temporal predictor that is trained to predict future slice in a set of previous consecutive slices. cycleflow [11] is a flow-based method, but ignores the shared latent space Z (directly map from \(X \rightarrow Y\), instead of \(X \rightarrow Z \rightarrow Y\) as the alignflow method). The synthetic image from each method was quantitatively compared with the real paired image using the following performance metrics: mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM).

Human Connectome Project (HCP) is provided by the Human Connectome project [13]. We used T1 as the source domain and T2 as the target domain. We extract the axial view of T1/T2 images into 2D images. We split the 2D images into 1150 images for training set and 500 images for testing set.

MRBrainS13: [8] contains 15 subjects for training and validation and 6 subjects for testing. For each subject, two modalities are available that include T1-weighted, and T2-FLAIR with an image size of \(48 \times 240 \times 240\). We extract the dataset into 2D images with 450 images for training and 150 images for testing

Brats2019: [9] includes 210 HGG scans and 75 LGG scans. Each scan has a dimension of \(240 \times 240 \times 155\). For each scan, we extract it to 2D images and use 770 images for training and 250 images for testing.

Training. All networks were implemented using the Pytorch framework and trained on the 12GB GPU. The input image is resized to \(128 \times 128\) and normalized to \([-1, 1]\). We used axial slices (10 slices around the middle slice) from the each subject. The Adam optimizer with a batch size of two was used to train the network. The initialization learning rate was set as 0.0002 and was decreased ten times every 20 epochs. We trained each model for 100 epochs. The balance weights were set as \(\lambda _X=\lambda _Y=1e^{-5}, \lambda = \lambda _{1}=\lambda _{2}=10, \beta _1=\beta _2=1,\gamma _1=\gamma _2=1\). The discriminator network is a \(70 \times 70\) PatchGAN [6]. For alignflow network [5], we set the number of scale was 1, number of block was 3. We use two consecutive slices (before and later slices) to learn the temporal constraint.

Fig. 4.
figure 4

A visualization of synthetic images on different datasets generated by (a) source image, (b) target image, (c) cycleGAN, (d) recycleGAN, (e) cycleflow, (f) alignflow, and (g) our method. Our method provides a good boundary on the tumor regions (red arrows in the fifth row) compared with the existing methods (Color figure online)

4.2 Performance Evaluation

Qualitative Evaluation. Figure 4 illustrates the image translation on different datasets. The proposed methods (in the last column) provided a better synthetic image, resulting in better MSE, SSIM and PSNR scores. For example, the proposed synthetic T2 image provides a high qualitatively difference along the tumor boundary (indicated by the red arrows in the fifth row) than in existing methods using the available source T1 image as input.

Quantitative Evaluation. Tables 1 reports the MSE, PSNR and SSIM values of the proposed method and existing methods. From the table, it is clear that the flow-based method (such as cycleflow [11], alignflow [5] and our method) provides competitive results with GAN-based method (such as cycleGAN, recycleGAN). By adding temporal constraints, the proposed network outperforms the baseline method (alignflow) on all performance metrics. Different from recycleGAN, that exploits temporal information via future slice prediction from consecutive slices, the proposed method measures pixel-wise temporal consistency by directly warping the synthetic slices with the deformation field of the consecutive slices from the source, and thus achieves better performances. This indicates the effectiveness of the proposed method in the unpaired image to image translation for medical image.

Table 1. Comparison between the proposed method against other image-to-image translation methods on HCP, MRBrainS13, Brats19 datasets.

5 Conclusion

We presented an effective method for image-to-image translation based on flow-based methods and deformation information that allows the proposed method to exploit the temporal information between consecutive slices to constrain the translation image. We show that the proposed method can provide a good translation image, yielding a better MSE, PSNR, and SSIM on various MRI datasets. Although our network is a fully invertible property, it requires more memory resource than GAN-based methods (such as cycleGAN, recycleGAN, ...).