Keywords

1 Introduction

Magnetic Resonance Imaging (MRI) has currently become prominent in neurology with the use of high-field scanners, given that MRI contrasting agent is less likely to cause an allergic reaction compared with X-ray or CT scan using iodine-based substances [24]. However, due to modality corruption, incorrect machine settings, allergies to specific contrast agents, and limited available time [7], it is often not guaranteed to obtain a complete set of MRI sequences to provide rich information for clinical diagnosis and therapy. In this regard, the development of cross-modality or cross-protocol MRI synthesis techniques is important to homogenize and “repair” such real-world data collections via efficient data infilling and re-synthesis, and make them accessible for algorithms that require complete data sets as input [2, 10, 23, 30].

Recently a large number of algorithms for medical image synthesis have been proposed with the rapid growth of deep learning techniques [6, 11, 27, 37]. Among them, generative adversarial networks (GANs) with the advantage of recovering an unprecedented level of image realism [15] have achieved significant advancement for medical image synthesis. For example, Ea-GANs [32] incorporated edge information and focused on the textural details of image content structure for cross-modality MRI synthesis. SA-GANS [33] added a sample-adaptive path additionally to learn the relationship of each sample with its neighboring training samples. MM-GAN [5] and mustGAN [34] were designed to deal with multi-modal MRI synthesis with structures capable of fusing latent representations of each input modality. However, these state-of-art methods [5, 7, 10, 13, 32,33,34, 36, 38] fixed the network architecture for various input modality combinations (e.g., T1, T1-weighted, Flair, T1\(+\)Flair, etc.) and ignored the mapping uniqueness between each source domain and target domain pair, and therefore could not reach the optimal solution for all situations using the same network structure.

Inspired by the great potential of neural architecture search (NAS) in computer vision field [8, 17,18,19, 25, 31, 35, 40], we explore NAS to automatically find an optimal network with fewer computation costs and parameters for different input modalities. Searching for a dedicated MRI synthesizer is essentially promising because the problem nature of using one network for many synthesis tasks caters to the NAS principle of construing one search architecture for multi-jobs. However, how to search the architecture of generative networks according to the different input modalities given to the synthesizer is still unexplored so far. In this paper, we aim to adaptively optimize and construct the neural architecture that is capable of understanding how to extract and merge features according to different input modalities. Specifically, we adopt a GAN-based structure as the backbone of NAS, where the generator of the GAN is searched by gradient-based NAS from the multi-scale module-based search space. Our main contributions of the AutoGAN-Synthesizer are as follows: (1). Aiming at the recovery of realistic texture while constraining model complexity, we propose a GAN-based perceptual searching loss that jointly incorporates the content loss and model complexity. (2). To incorporate richer priors for MRI synthesis, we exploit MRI K-space knowledge containing low-frequency (e.g., contrast and brightness) and high-frequency information (e.g., edges and content details), to guide the NAS network to extract and merge features. (3). Considering that the low- and high-resolution of multi-scale networks can capture the global structure and local details respectively, we use a novel multi-scale module-based search space which is specifically designed for multi-resolution fusion. The module-based searching setting is also capable of reducing search time while maintaining the performance. (4). Finally, our searching strategy can produce a light-weight network with 6.31 Mb parameters from module-based search space only taking 12 GPU hours and achieve state-of-the-art performance. From our knowledge, this is the first work to explore AutoML for cross-modality MRI synthesis tasks.

2 Proposed Method

Fig. 1.
figure 1

(a). Overall architecture of our proposed AutoGAN-Synthesizer. The AutoGAN-Synthesizer contains two parts: 1) A NAS-based generator to adaptively build up an architecture based on input modalities (\(X_{1+k}\), \(X_{2+k}\) or \(X_{3+k}\)). \(X_{i+k}\) represents the input with i modalities and the corresponding K-space features (denoted as k). 2) A discriminator to distinguish between the synthesis and real modality. (b). Generator search space consisting of three modules to capture and fuse the detailed information and the global features from different multi-scale branches: horizontal module, extension module and the composite module. (c). An example of an optimized generator including the three proposed modules.

Motivation: Most recent-used networks for MRI synthesis usually adopt an encoder-decoder structure [7, 14, 32, 38], which recovers the high-resolution features mainly based on the low-resolution representation received from successive convolutional blocks in the encoder. This latent representation contains only high-level features and loses lots of detailed information, leading the recovered images neither semantically strong nor spatially precise. Inspired by the fact that the low- and high-resolution branches of multi-scale networks are capable of capturing global structure and preserving the local details respectively [8, 22, 28], we design a generator based on a multi-scale structure including three modules: horizontal module connects different resolution input in parallel without any fusion, extension module adds a downsampling block to extend a lower-resolution scale and composite module fuses cross-resolution representations to exchange information. An overview of our AutoGAN-Synthesizer is shown in Fig. 1. Specifically, the framework contains an adaptive generator constructed by neural architecture search according to input modalities and a typical discriminator to distinguish between predictions and ground truths.

2.1 NAS-Based Generator Search

Generator Search Space: There exists an open question on how to extract or fuse the features of modalities in a multi-scale generator. To solve this question, we propose three different modules (Fig. 1(b)) to give guidelines for extracting and merging multi-resolution features: namely horizontal module, extension module and composite module. These three modules behave differently to mimic the coarse-to-fine framework and exploit multiple possibilities of multi-scale fusion. Specifically, the horizontal module horizontally connects features via convolution block without feature fusion among multi-scales. As shown in Fig. 1(b), the feature resolution at the same scale keeps identical but is reduced by 0.5 when the scale goes deeper. The extension module extends a lower-resolution scale via a down-sampling block. This connection helps to exploit the high-level priors extracted at the low-resolution scale and simultaneously remain the unchanged resolution at the high-resolution scale. The composite module merges multi-resolution features by skip connection, stride convolution and up-sampling block, which can be summarized as:

$$\begin{aligned} F_g = \sum \mathcal {M} _{r\rightarrow g}(F_r) \end{aligned}$$
(1)

where r is the resolution of the input feature maps while g is the resolution of the output features. \(F_r\) represents the input feature maps with resolution r and \(F_g\) denotes the output feature maps with resolution g after combing all the features from other resolution scales. \(\mathcal {M}_{r\rightarrow g}(\cdot )\) is the mapping function defined as follows:

$$\begin{aligned} \mathcal {M} _{r\rightarrow g}(F_r) = \left\{ \begin{array}{ccl} F_r &{} &{} {r == g}\\ Upsampled\ F_r &{} &{} {r < g}\\ Downsampled\ F_r &{} &{} {r > g} \end{array} \right. \end{aligned}$$
(2)

Compared with the common fusion scheme [5, 26], that only fuses high-resolution features with the upsampled low-resolution features unidirectionally, this module aggregates the feature fusion via a bidirectional way among multi-resolution representations. Thus, this powerful multi-resolution fusion scheme catches more spatial information extracted from all resolution scales and therefore is semantically stronger.

The combination of three modules constructs a superior neural architecture which gives guidance on how to extract and merge the features catering to the requirements of different input modalities. An example of an optimized generator could be found in Fig. 1(c). The input modalities are imported into a super-network with two fixed horizontal modules, S modules selected from horizontal, extension, and composite modules candidates, and a final composite module followed by a 1\(\times \)1 convolutional layer. During the searching process, the progressive structure can gradually add the multi-resolution modules and endow the output with multi-resolution knowledge.

2.2 GAN-Based Perceptual Loss Function

In order to recover a realistic and faithful image, we add both the perceptual and pixel-level loss into our generator loss function:

$$\begin{aligned} \mathcal {L}_{Generator} =\mathcal {L}_{content}+\lambda _{adv}\mathcal {L}_{adv}+ \lambda _{complexity}\mathcal {L}_{complexity}, \end{aligned}$$
(3)

where \(\mathcal {L}_{content}\) is the content loss consisting of pixel-level loss (mean square loss) and texture-level loss (perceptual loss) between the ground-truth and reconstructed images. \(\mathcal {L}_{adv}\) is the adversarial loss based on binary cross-entropy formulation to make the reconstructed image closer to the ground-truths. \(\mathcal {L}_{complexity}\) is the loss term for calculating the model complexity (e.g., FLOPS, consuming time, and model size).

2.3 K-space Learning

K-space is the spatial frequency representation of MRI images. Due to the long scan time acquiring MRI images, several MRI reconstruction methods based on under-sampled K-space learning are proposed for fast MRI [1, 9, 14]. Inspired by this, we embed K-space learning into our pipeline to introduce frequency priors of MRI images, which is defined as follows:

$$\begin{aligned} \hat{x}(k) = \mathcal {F}[x]\{k\} = \int _{R^2}e^{-jk\cdot r}x(r)\,dr, \end{aligned}$$
(4)

where \(k\in \mathbb {R}^2\) represents the spatial frequency and \(j^2=-1\). x(r) is the pixel intensity in real space while \(\hat{x}(k)\) is the calculated intensity in frequency domain. K-space is computed according to the input modalities and is used as the input together with the MRI images in real space.

2.4 Implementation Details

Searching Setting: For each different input modality, we search for a new architecture to give guidance on how to extract and fuse the multi-modality features. First, we train the warm-up stage (ten epochs) to get desirable weights of convolution layers and then train a searching stage with 200 epochs for optimizing the structure of architecture. For updating the parametric model, we adopt the standard SGD optimizer with the momentum of 0.9 and the learning rate decays from 0.025 to 0.001 by cosine annealing strategy [20]. Besides, to optimize the architecture parameters, Adam optimizer [16] is used with a learning rate of 0.0005. The batch size is 16 by randomly cropping and padding image size \(240\times 240\). Overall, the whole searching process consumes 12 h.

Training Setting: After finding an architecture, we train this for 500 epochs with batch size 16 and image size \(240\times 240\). The Adam optimizer with the learning rate of 0.0005 is adopted. All training experiments are implemented in Pytorch with a Tesla V100.

3 Experimental Results

3.1 Experimental Settings

We evaluate the performance of AutoGAN-Synthesizer on one-to-one and multiple-to-one cross-modality MRI synthesis tasks using two public brain MRI datasets: BRATS2018 and IXI. BRATS2018 dataset [3, 4, 21] collects multi-modality MR image sets from patients with brain tumors including four different modalities: native (T1), T1-weighted and contrast-enhanced (T1ce), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (FLAIR), where each scan has the size of 240 \(\times \) 240 \(\times \) 155. In this paper, we conduct one-to-one and multi-to-one synthesis tasks on BRATS2018 dataset to show the effectiveness of our method. Following [7], we randomly select 50 low grade glioma (LGG) from total 75 LGG patients as the training set while another unseen 15 patients are selected as the test. Following [7, 32, 34], we also use the public IXI datasetFootnote 1 to verify the model generalization. IXI dataset collected multi-modality MR images from healthy subjects at three different hospitals. It is randomly divided into training (25 patients), validation (5 patients), and test patients (10 patients). For each subject, after removing some cases with major artifacts, approximately 100 axial cross sections that contained brain tissue are manually selected.

Table 1. Quantitative results of Flair-T2 (BRATS2018 dataset) and T1-T2 (IXI dataset) MRI cross-modality synthesis tasks.

Evaluation Metrics: Following studies [32, 33], three metrics are used to evaluate the quantitative performance: normalized root mean-squared error (NRMSE), peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [29].

3.2 Comparisons with State-of-the-Art Methods

To verify the performance of our AutoGAN-Synthesizer, we compare it with five recent state-of-the-art methods: CycleGAN [39], Pix2pix [12], pGAN and cGAN [7], and Hi-Net [38]. To ensure a fair comparison with state-of-the-art methods, we train all networks on the same dataset by the open-source implementations as well as the recommended hyper-parameters from authors.

Fig. 2.
figure 2

Qualitative results of FLAIR \(\xrightarrow []{}\)T2 synthesis experiments on glioma patients in BRATS2018 dataset. Compared with other state-of-art results, our synthetic images recover favorable tissue contrast, tumor, and anatomy knowledge which have great potential in clinical diagnoses and treatments.

Fig. 3.
figure 3

Visual performance of synthetic T2 modality difference maps compared with other state-of-art methods on BRATS2018 dataset.

One-to-One Cross-Modality MRI Synthesis Tasks: We focus on synthesizing T2 contrasts that are complementary to T1 contrasts, and offer better information for investigating fluid-filled structures within tissue. The experimental results for one-to-one synthesis tasks are listed in Table 1. For FLAIR-T2 cross-modality synthesis task, Table 1 shows that our AutoGAN achieves better performance than other cutting-edge methods on three metrics. Figures 2 and 3 show the qualitative comparison between our proposed AutoGAN and other five state-of-art methods on BRATS2018 dataset. The difference maps are generated based on the pixel intensity and visualized in the type of heat maps. It can be seen that our synthetic images have clearer details in the zoomed rectangles, and also preserve favorable tissue contrast, tumor, and anatomy knowledge which have great potentials in clinical diagnoses and treatments. Overall, our methods could reach higher fidelity with the target images and our method can search for satisfactory synthesis networks which are better than manually designed architectures. The superiority of AutoGAN is mainly attributed to our module-based search space, which can well exploit the information fusion between the low- and high-resolution features. As shown in Table 1, the quantitative results on the IXI dataset also imply that our AutoGAN achieve better generalization than other methods.

Model Complexity: It can be seen from Table 1 that our AutoGAN achieves SOTA performance only using very light-weight network architecture with 6.30 Mb parameters, which is nearly half of the other manually-designed networks (around 11 Mb) [7, 39].

Multiple-to-One Cross-Modality MRI Synthesis Tasks. To verify the effectiveness of our method on multiple-to-one tasks, we conduct experiments with different combinations of input modalities on BRATS2018 dataset in Fig. 4(b). Compared with Hi-Net that is specifically designed for two modalities input, our AutoGAN demonstrates considerable improvements, with PSNR rising from 24.95 dB (Hi-Net) to 27.12 dB (ours) in the task of FLAIR+T1\(\xrightarrow []{}\)T2. Figure 4(b) also verifies that our method can fuse multiple input modalities and provide a promising performance. In addition, it illustrates that more input modality knowledge can also boost the synthesis performance. Figure 4(a) shows qualitative results of different multiple modalities input tested on the searched models by our AutoGAN. The results of FLAIR+T1+T1ce\(\xrightarrow []{}\)T2 task are much better than the other three configurations in visualization, which is consistent with the observation from quantitative evaluation. It also verifies that different modalities contains partly complementary knowledge, which can boost the synthesis performance.

Fig. 4.
figure 4

Multiple-to-one MRI cross modality synthesis tasks on BRATS2018 dataset: (a). Qualitative comparison of difference maps. (b). Quantitative results.

3.3 Ablation Study

Study of Each Component: We conduct an ablation study to demonstrate the effectiveness of each component, i.e., the perceptual and adversarial part of our loss function, and the MRI K-space learning strategy. In Fig. 5(a), we list all results of different configurations on these three components in FLAIR \(\xrightarrow []{}\)T2 synthesis tasks on BRATS2018 dataset. It indicates that the perceptual and adversarial loss can further improve quantitative performance. After adding the perceptual and adversarial loss, our algorithm can rehabilitate highly-realistic images with better structure similarity and peak signal-to-noise ratio. Furthermore, MRI K-space features embedded in the network can introduce additional information and therefore can also boost performance improvement. Figure 6 shows the qualitative results of the ablation study. We find that adding each component successively can obtain better synthetic images. In Fig. 6, the FLAIR image has poor quality and therefore is challenging to synthesize a reasonable T2 image. However, with the help of perceptual loss, adversarial loss, and K-space learning, the results are further improved and the missing part is gradually compensated.

Fig. 5.
figure 5

(a). Ablation study of our GAN-based loss and MRI K-space features on BRATS2018 dataset (FLAIR \(\xrightarrow []{}\)T2). (b). Comparison of our search strategy with random policy. Our AutoGAN can search light-weight networks with better performance.

Fig. 6.
figure 6

Visualization results of our ablation study, showing the effectiveness of three components in our pipeline: i.e., perceptual loss, adversarial loss and K-space learning. The version of baseline represents the network without three components, \(+\)perceptual means the baseline with only perceptual loss, \(+\)adversarial denotes baseline with perceptual and adversarial loss and \(+\)Kspace represents our complete method with perceptual, adversarial loss and K-space learning.

Effectiveness of Our Search Strategy: To verify the effectiveness of our search strategy in AutoGAN, we compare our search strategy with random policy by randomly sampling 20 models from our search spaces. From Fig. 5(b), compared with random policy, our AutoGAN can search superior networks with less model size but better performance. More specifically, the networks from random policy have a wide range of model sizes from 6 Mb to around 12.5 Mb. But the search strategy of our AutoGAN is capable of constraining the model size of network within a much smaller interval by greatly reducing both the lower-bound and the upper-bound of the model sizes without sacrificing the performance. This superiority makes it easier to deploy AI models in a variety of resource-constrained clinical scenarios.

4 Conclusion

We propose AutoGAN-Synthesizer to automatically design a generative network knowing how to extract and fuse features according to different input modalities for cross-modality MRI synthesis. A novel GAN-based perceptual searching loss incorporating specialized MRI K-space features is proposed to rehabilitate a highly-realistic image and to balance the trade-off between model complexity and performance. The proposed method outperforms other manually state-of-art synthesis algorithms and restores faithful tumor and anatomy information.