Keywords

1 Introduction

Multimodal MRI scans are useful to identify brain tumors, and the brain tumor segmentation is an important task before the diagnosis. The major goal of the medical image segmentation is to extract the areas of interest in an image, such as tumor regions. It allows doctors to focus on the most important areas for diagnosis or monitoring [7, 11]. For a long time, manual tumor segmentation from MRI scans conducted by physicians is a time-consuming task. In order to speed up the process of image segmentation, many automated methods were developed [2,3,4,5, 7, 9, 10]. With the rapid development of deep learning technologies in the field of computer vision, deep-learning based automated brain tumor segmentation becomes increasingly popular [14]. In particular, convolutional neural networks (CNNs) show great successes in image segmentation tasks [12], and especially the U-Net [13] earns the most credits.

Multimodal Brain Tumor Segmentation Challenge (BraTS) is an annual challenge aims at gathering state-of-the-art methods for the segmentation of brain tumors. Participants are provided with clinically acquired training data to develop their own models and produce segmentation labels of three glioma sub-regions: enhancing tumor (ET), tumor core (TC), and whole tumor (WT) [2,3,4,5, 9, 10]. The BraTS 2020 training dataset contains 369 cases, including 293 high-grade gliomas (HGG) and 76 low-grade gliomas (LGG) cases, each with four 3D MRI modalities: the native (T1) and the post-contrast T1-weighted (T1CE) images, and the T2-weighted (T2) and the T2 Fluid Attenuated Inversion Recovery (FLAIR) images. The example images of the four modalities are shown in Fig. 1. The validation dataset contains 125 cases and the test dataset contains 166 cases. The dimension of each MRI image is 240 \(\times \) 240 \(\times \) 155.

Fig. 1.
figure 1

Visualization of the four modalities in BraTS 2020 training Dataset (a) T1, (b) T1CE, (c) T2 and (d) Flair, and (e) the segmentation on T1. These images are from the same patient. In (e), the blue area represents the Necrotic and Non-Enhancing Tumor (NCR/NET), the orange area represents the peritumoral edema (ED) and the white area represents the GD-enhancing tumor (ET). (Color figure online)

In this paper, we propose a model that combines the variational-autoencoder (VAE) regularized 3D U-Net model [10] and the MultiResUNet model [7], which is used to train end-to-end on the BraTS 2020 training dataset. Our model follows the encoder-decoder structure of the 3D U-Net model of [10] used in BraTS 2018 Segmentation Challenge but exchanges the ResNet-like block in the structure with the “MultiRes block” and connects the feature maps from the encoder stages to the decoder stages with the “Res path”, a chain of convolutional layers with residual connections [7].

2 Related Work

2.1 VAE-regularized 3D U-Net

In BraTS 2018, the top-1 winner Myronenko [10] proposed an encoder-decoder CNN model with an asymmetrically large encoder to extract deep image features and a VAE branch to regularize the encoder. In BraTS 2019, the champion team Jiang et al. [8] proposed a novel model, a two-stage cascaded U-Net. The first stage of the model is a variant of the asymmetrical U-Net in [10] and the second stage of the model doubles the number of filters in the initial 3D convolution to increase the network width and has two decoders, one using deconvolutions and another using the trilinear interpolation to facilitate regularizing encoders. The top ranks of the two models in the challenge have proved the power of VAE-regularized 3D U-Nets in the MRI brain tumor segmentation.

2.2 MultiResUNet

In early 2020, Ibtehaz and Rahman [7] proposed a modified U-Net, called MultiResUNet, which outperforms the classical U-Net. There are two important parts of the architecture: the MultiRes block and the Res path. In the MultiRes block, there are three 3 \(\times \) 3 consecutive convolutional layers with gradually increasing numbers of filters. A residual connection and a 1 \(\times \) 1 convolutional layer are added in each MultiRes block to gather more spatial information. The Res path passes the feature maps from the encoder stage through a chain of 3 \(\times \) 3 filters with 1 \(\times \) 1 filters residual connections and then concatenates them with the decoder features. This MultiResUNet can also be applied to 3D images. The structures of the MultiRes block and the Res path are shown in Figs. 2 and 3, respectively.

Fig. 2.
figure 2

The structure of the MultiRes block: the green boxes represent convolutional layers. The three 3 \(\times \) 3 convolutional layers have increasing numbers of filters and then they are concatenated together, which can help the model capture different information from different scales. Then the result from the 1 \(\times \) 1 convolutional layer following the input is added to the result from the three 3 \(\times \) 3 convolutional layers to obtain the final output of the MultiRes block. (Color figure online)

Fig. 3.
figure 3

The structure of the Res path: the green boxes represent convolutional layers. In many classic U-Nets, the encoder feature maps are simply sent to the decoder part. However, it is possible that there is semantic gap between the features of the decoder part and the encoder part, since the decoder part experiences more processing compared to the encoder part. Therefore, Ibtehaz and Rahman [7] proposed the Res path between the encoder and decoder before the features from the encoder part are concatenated to the decoder part. Within the Res path are series of 3 \(\times \) 3 convolution layers with 1 \(\times \) 1 residual connections. (Color figure online)

In this paper, we modify the VAE-regularized 3D U-Net model [10] by exchanging the initial 3 \(\times \) 3 \(\times \) 3 3D convolution block with a revised 3D MultiRes block based on the MultiRes block in [7] and adding the Res path between the encoder and decoder stages.

3 Method

The architecture of our proposed network is shown in Fig. 4. Due to the limited computational resource, each image is cropped from the size of 240 \(\times \) 240 \(\times \) 155 voxels to 160 \(\times \) 192 \(\times \) 128 and then resized to 80 \(\times \) 96 \(\times \) 64. Each MRI modality is fed into the network as one channel and so we have totally four channels.

Fig. 4.
figure 4

The architecture of our MultiResUNet with a VAE branch. The size of the original input is 4 \(\times \) 80 \(\times \) 96 \(\times \) 64. Each MultiRes block has three 3 \(\times \) 3 \(\times \) 3 sequential convolution layers and a 1 \(\times \) 1 \(\times \) 1 convolutional residual connection. Before each convolution layer, we use the group normalization and the ReLU activation function. In the Res path, there are two 3 \(\times \) 3 \(\times \) 3 convolutional layers each with one 1 \(\times \) 1 \(\times \) 1 residual connection. The output segmentation result has three channels and the shape of each channel is 4 \(\times \) 80 \(\times \) 96 \(\times \) 64. Finally, we use the sigmoid activation function to obtain the segmentation for the three regions: WT (TC+ET), TC (ET+NCR/NET) and ET.

3.1 Encoder

The encoder part has multiple MultiRes blocks, each of which contains three 3 \(\times \) 3 \(\times \) 3 sequential convolution layers with a 1 \(\times \) 1 \(\times \) 1 residual connection to gather spatial features at different scales. After each MultiRes block, we use strided convolutions to downsize the image dimension by 2 and increase feature size by 2. The number of filters in the three 3 \(\times \) 3 \(\times \) 3 sequential convolution layers is 6, 12 and 18 respectively. Each MultiRes block contains the Group Normalization and the ReLU activation function. The output layer uses the sigmoid activation function.

3.2 Decoder with a VAE Branch

The decoder part also has multiple MultiRes blocks. After the decoder, the output has three channels and the image of each channel has the same size as the input. We use strided convolutions to upsize the image dimension by 2 and decrease the feature size by 2.

For the VAE branch, we adopt the similar VAE structure used in [10]. The output of the fourth MultiRes block (MultiRes Block 4 in Fig. 4) is reduced to 256 \(\times \) 1. We then draw a sample from a Gaussian distribution whose mean and standard deviation are both 128. Then, without connection to the encoder part through Res Path, we use the same structure as the decoder part to return to the original dimension 4 \(\times \) 80 \(\times \) 96 \(\times \) 64. This step helps us to regularize encoder in training to prevent overfitting.

The VAE detail is shown in Table 1.

Table 1. The VAE part. Conv3 represents 3 \(\times \) 3 \(\times \) 3 convolutional layer. GN is the group normalization and the group size is 8. Conv1 represents 1 \(\times \) 1 \(\times \) 1 convolutional layer. Conv1 RC is the 1 \(\times \) 1 \(\times \) 1 convolutional residual connection in the MultiRes block. AddId is the addition of the residual connection. Concate is the concatenation of the three consecutive Conv3. Step 5, 7 and 9 are the MultiRes Block 9, 10, 11 in Fig. 4.

3.3 Res Path

The original skip connection between encoder and decoder parts is replaced by the Res path in our network model. Within each Res path, there is a chain of two 3 \(\times \) 3 \(\times \) 3 convolutional layers our network architecture and each convolutional layer has one 1 \(\times \) 1 \(\times \) 1 convoulational residual connection. The feature maps from the encoder parts is passed into the Res path and the output is concatenated with the decoder features. The number of filters in the Res Path reduces as the image downsizes. All convolutions in the Res path have 32 filters.

3.4 Loss Function

We use a combined loss function with three components: the soft Dice loss (\(L_{dice}\)), the L2 loss on the VAE part (\(L_{L2}\)) and the VAE penalty term (\(L_{KL}\)), KL divergence between the estimated normal distribution \(N(\mu , \sigma ^2)\) and the prior distribution N(0, 1), which was used in [10].

The equation of the soft dice loss is

$$\begin{aligned} L_{dice}=\frac{2*\sum S*R}{\sum S^2+\sum R^2 +\epsilon } \end{aligned}$$
(1)

where S represents the true labels and R represents the predicted output by the model.

The equation of \(L_{L2}\) is

$$\begin{aligned} L_{L2}=||I_{origin}-I_{VAE_{pred}}|| \end{aligned}$$
(2)

where \(I_{origin}\) is the original input image and \(I_{VAE_{pred}}\) is the predicted image from the VAE part.

The equation of \(L_{KL}\) is

$$\begin{aligned} L_{KL}=\dfrac{1}{N}\sum \mu ^2+\sigma ^2-log\sigma ^2-1 \end{aligned}$$
(3)

where N is the number of voxels in the image, \(\mu \) and \(\sigma \) are the parameters of the estimated normal distribution.

3.5 Optimization

We use the Adam optimizer with an initial learning rate of \({\alpha _0}=10^{-4}\) and decrease the learning rate by

$$\begin{aligned} \alpha =\alpha _0*(1-\frac{e}{N_e})^{0.9}, \end{aligned}$$
(4)

where e is an epoch counter, and \(N_e\) is a total number of epochs. The total epoch is set to 300 and the batch size is 1.

4 Experiment

We use Python [15] to implement the experiment. Particularly, the library Keras [6] with Tensorflow [1] as the backend is used to build the network model. The model is trained on the BraTS 2020 training dataset (total 369 cases) without additional in-house data.

For the training dataset, there are 369 cases, among which 293 cases are high-grade gliomas (HGG) and 76 are low-grade gliomas (LGG) cases. There are 125 cases in the validation dataset. The validation result is submitted to CBICA’s Image Processing Portal (https://ipp.cbica.upenn.edu) for evaluation.

4.1 Data Preprocessing and Augmentation

All the original images are preprocessed to have mean 0 and standard deviation 1. The image is randomly cropped to the size of 160 \(\times \) 192\( \times \) 128. Due to the computational limitations, we resize the images from 160 \(\times \) 192 \(\times \) 128 to 80 \(\times \) 96 \(\times \) 64. For every input image, we randomly flip it across a random axis.

For consideration of robustness, we repeat the process of randomly flipping the input image 10 times for each case and use the flipped data as input to generate 10 training datasets. Using each of these training datasets, we train a model. Finally, we average the results from the 10 models to obtain a final segmentation.

4.2 Results

Our model is trained on NVIDIA Tesla V100 16 GB GPUs. If running on a single GPU, the approximate time for training 300 epochs is about 2.5 days. The best result of a single model on the validation dataset is provided in Table 2. The best final result on the validation dataset is reported in Table 3, which is obtained by averaging the results from 10 models trained on the 10 training datasets. The model ensemble improves our results by 0.4%. An example of our predicted segmentation is shown in Fig. 5.

Table 2. BraTS 2020 validation dataset result of the best single model.
Table 3. BraTS 2020 validation dataset result of the best ensembled model.
Fig. 5.
figure 5

An example of our predicted segmentation. The blue area represents the Necrotic and Non-Enhancing Tumor (NCR/NET), the orange area represents the peritumoral edema (ED) and the white area represents the GD-enhancing tumor (ET) (Color figure online)

5 Discussion and Conclusion

In this paper, we modified a VAE-regularized 3D U-Net to the new proposed MultiResUNet architecture by replacing the classic ResNet with the MultiRes block and adding Res Path between the encoder and decoder parts to reduce the possible semantic gap. The MultiRes Block extracts more information on different scales of the image. We experimented our method on the BraTS 2020 training dataset and validation dataset. The result has shown that the architecture has satisfactory performance. We have tried to train the 3D U-Net model proposed by the top-1 winner Myronenko [10] in BraTS 2018 Challenge. However, we met a challenge of limited computational resource since Myronenko trained his model on the NVIDIA Volta V100 32 GB GPU. We tried to crop the original image from 240 \(\times \) 240 \(\times \) 150 to 112 \(\times \) 126 \(\times \) 96 which may largely affect the performance of that model. With the increasing amount of data, the model would expect to have higher requirements for GPUs. In comparison, the result of our model is very close to the result of that model but requires less computational resource, which makes our model more economically attractive.