Abstract
Automatic MRI brain tumor segmentation is of vital importance for the disease diagnosis, monitoring, and treatment planning. In this paper, we propose a two-stage encoder-decoder based model for brain tumor subregional segmentation. Variational autoencoder regularization is utilized in both stages to prevent the overfitting issue. The second-stage network adopts attention gates and is trained additionally using an expanded dataset formed by the first-stage outputs. On the BraTS 2020 validation dataset, the proposed method achieves the mean Dice score of 0.9041, 0.8350, and 0.7958, and Hausdorff distance (95%) of 4.953, 6.299, 23.608 for the whole tumor, tumor core, and enhancing tumor, respectively. The corresponding results on the BraTS 2020 testing dataset are 0.8729, 0.8357, and 0.8205 for Dice score, and 11.4288, 19.9690, and 15.6711 for Hausdorff distance. The code is publicly available at https://github.com/shu-hai/two-stage-VAE-Attention-gate-BraTS2020.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Brain tumors can be categorized into primary tumors and secondary tumors depending on where they originate. Glioma, the most common type of primary brain tumor, can be further categorized into low-grade gliomas (LGG) and high-grade gliomas (HGG). HGG is a malignant brain tumor type with a high degree of aggressiveness that often requires surgery. Usually, several complimentary 3D Magnetic Resonance Imaging (MRI) modalities are acquired to highlight different tissue properties and areas of tumor spread. Compared to traditional methods that rely on physicians’ professional knowledge and experience, automatic 3D brain tumor segmentation is time-efficient and can provide objective and reproducible results for further tumor analysis and monitoring. In recent years, deep-learning based segmentation approaches have exhibited superior performance than traditional methods.
The Multimodal Brain Tumor Segmentation Challenge (BraTS) is an annual international competition that aims to evaluate state-of-the-art methods of brain tumor segmentation [1,2,3, 13]. The organizer provides a 3D multimodal MRI dataset with “ground-truth” tumor segmentation labels annotated by physicians and radiologists. For each patient, four 3D MRI modalities are provided including native T1-weighted (T1), post-contrast T1-weighted (T1c), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (T2-FLAIR) volumes. The brain tumor segmentation task concentrates on three tumor sub-regions: the necrotic and non-enhancing tumor (NCR/NET, labeled 1), the peritumoral edema (ED, labeled 2) and the GD-enhancing tumor (ET, labeled 4). Figure 1 shows an image set of a patient. The rankings of competing methods for this segmentation task are determined by metrics, including Dice score, Hausdorff distance (95%), Sensitivity, and Specificity, evaluated on the testing dataset for ET, tumor core (TC = ET + NCR/NET), and whole tumor (WT = TC + ED) [4].
In BraTS 2018, Myronenko [14] proposed an asymmetrical U-Net with a larger encoder for feature extraction and a smaller decoder for label reconstruction, and won the first place of the challenge. An encouraging innovation of the method is utilizing a variational autoencoder (VAE) branch to regularize the encoder and boost generalization performance. The champion team of BraTS 2019, Jiang et al. [12], proposed a two-stage network, which used an asymmetrical U-Net, similar to Myronenko [14], in the first stage to obtain a coarse prediction, and then employed a similar but wider network in the second stage to refine the prediction. An additional branch was adopted in the decoder of the second-stage network to regularize the associated encoder. The success of the above two models indicates the feasibility and the importance of adding a branch to the decoder to reduce overfitting and boost the model performance.
Compared with general computer vision problems, 3D MRI image segmentation tasks generally face two special challenges: the scarcity of training data and the class imbalance [20]. To alleviate the shortage of training data, Isensee et al. [11] took the advantage of additional labeled data by using a co-training strategy. Zhou et al. [23] combined several performance-boosting tricks, such as introducing a focal loss to alleviate the class imbalance, to achieve further improvements.
For brain tumor segmentation tasks specifically, another challenging difficulty is the variability of tumor morphology and location across different tumor development stages and different cases. To improve the prediction accuracy, many segmentation methods [16, 18, 22, 23] decompose the task into separate localization and subsequent segmentation steps, with additional preceding models for object localization. For instance, Wang et al. [18] sequentially trained three networks according to the tumor subregion hierarchy. Oktay et al. [15] demonstrated that the same objective can be achieved by introducing attention gates (AGs) into the standard convolutional-neural-network framework in pancreas tumor segmentation tasks.
Inspired by aforementioned works, in this paper we propose a two-stage cascade network for brain tumor segmentation. We borrow the network structure of Myronenko [14] as the first-stage network to obtain relatively rough segmentation results. The second stage network uses the concatenation of the preliminary segmentation maps from the first-stage network and the MRI images as the input, with the aim to refine the prediction of the NCR/NET and ET subregions. We apply AGs [15] to further suppress the feature responses in irrelevant background regions. Our second-stage network exhibits the capabilities to (i) provide more model candidates with competitive performance for model ensembling, (ii) stabilize the predictions across models of different epochs, and (iii) improve the performance of each single model, particularly for NCR/NET and ET. The implementation details and segmentation results are provided in Sects. 3 and 4.
2 Method
The proposed two-stage network structure consists of two cascaded networks. The first-stage network takes the multimodal MRI images as input and predicts coarse segmentation maps. The concatenation of the preliminary segmentation maps and the MRI images is passed into the second-stage network to generate improved segmentation results.
2.1 The First-Stage Network: Asymmetrical U-Net with a VAE Branch
The network architecture (Fig. 2) consists of a larger encoding path for semantic feature extraction, a smaller decoding path for segmentation map prediction, and a VAE branch for input images reconstruction. This part is identical to the network proposed in [14].
Encoder. The encoder consists of ResNet [7, 8] blocks for four spatial levels, with the number of blocks 1, 2, 2, and 4, respectively. Each ResNet block has two convolutions with Group Normalization and ReLU, followed by an additive identity skip connection. The input of the encoder is an MRI crop of size \(4\,{\times }\,160\,{\times }\,192\,{\times }\,128\), with the first channel referring to the four MRI modalities. The input is processed by a \(3\,{\times }\,3\,{\times }\,3\) convolution layer with 32 filters and a dropout layer with a rate of 0.2, and then passed through a series of ResNet blocks. Between every two blocks with different spatial levels, a \(3\,{\times }\,3{\times }3\) convolution with a stride of 2 is used to reduce the resolution of the feature maps by 2 and double the number of feature channels simultaneously. The endpoint of the encoder has size \(256\,{\times }\,20\,{\times }\,24\,{\times }\,16\), which is 1/8 of the spatial size of the input data.
Decoder. The decoder has an almost symmetrical architecture with the encoder, except for the number of ResNet blocks within each spatial level is 1. After each block, we use a trilinear up-sampler to recover the spatial size by 2 and a \(1 \,{\times }\,1\,{\times }\,1\) convolution to reduce the number of feature channels by 2, followed by an additive skip connection from the encoder output of the corresponding spatial level. The operations within each block are the same as those in the encoder. At the end of the decoder, a \(1\,{\times }\,1\,{\times }\,1\) convolution is used to reduce the number of feature channels from 32 to 3, followed by a sigmoid function to convert feature maps into probability maps.
VAE Branch. This decoder branch receives the output of the encoder and produces a reconstructed image of the original input. In the beginning, the decoder endpoint output is reduced to a lower-dimensional space of 256 using a fully connected layer, where 256 represents 128 means and 128 standard deviations of Gaussian distributions, from which a sample of size 128 is drawn. Then the drawn vector is mapped back to the high-dimensional space with the same spatial property and reconstructed into the input image dimensions gradually following the same strategy as the decoder. Notice that there is no additive skip connection between encoder and the VAE branch.
2.2 The Second-Stage Network: Attention-Gated Asymmetrical U-Net with the VAE Branch
The input of the second-stage network (Fig. 2) is constructed based on the segmentation maps produced by the first-stage network. To alleviate the label imbalance problem, we crop the output of the first-stage network into a spatial size of \(128\,{\times }\,128\,{\times }\,128\) voxels concentrating on the tumor area. The cropped segmentation maps are then concatenated to the original MRI images (cropped to the same area).
Encoder. The encoder part of the second-stage network has the same structure as in the first-stage network, whereas the input has 7 channels (3 for segmentation maps and 4 for multimodal MRI images), and has a spatial size of \(128\,{\times }\,128\,{\times }\,128\) voxels.
Decoder. Different from the first-stage network, we add the AGs of [15] in the decoder part. The architecture of the AGs is demonstrated in the next sub-section. At each spatial level, the gating signal from the coarser scale is passed into the attention gate to determine the attention coefficients. The output of an AG is the Hadamard product of input features from encoder through skip connection and attention coefficients. The output of AG at each spatial level is then integrated with the 2-times up-sampled features from the coarser scale by an element-wise summation. The rest of the network architecture remains the same as the decoder in the first-stage network.
Attention Gate. Instead of using a single identical scalar value to represent attention level for each voxel vector, a gating vector \(g_i\) is computed to determine focus regions for each voxel i. Within the l-th spatial level, the AG is formulated as follows:
In each AG (Fig. 3), complementary information is extracted from the gating signal \(g^{l+1}_i\) from the coarser scale. To reduce the computational cost, linear transformations \(W^T_x\) and \(W^T_g\) (\(1\,{\times }\,1\,{\times }\,1\) convolutions) are performed on the input features \(x^l_i\) and gating signals \(g^{l+1}_i\), to downsize the feature size by 2, and to reduce the number of channels by 2, respectively. The transformed input features and gating signals therefore have the same spatial shape. The sum of them through element-wise summation is activated by the ReLU function \(\sigma _1\) and mapped by \(W_{int}^T\) into a lower dimensional space for gating operation, followed by the sigmoid function \(\sigma _2\) and a trilinear up-sampler to restore the size of attention coefficients matrix \(\alpha ^l_i\) to match the resolution of the input features. The output \(\hat{x}^l_i\) of the AG is obtained by element-wise multiplication of the input features \(x^l_i\) and the attention coefficient matrix \(\alpha ^l_i\).
2.3 Loss Function
For both stages, the loss function has 3 parts:
\(L_{dice}\) is the soft dice loss that encourages the decoder output \(p_{pred}\) to match the ground-truth segmentation mask \(p_{true}\):
\(L_{L2}\) is the L2 loss that is applied to the VAE branch output \(I_{pred}\) to match the input image \(I_{input}\):
\(L_{KL}\) is the KL divergence that is used as a VAE penalty term to induce the estimated Gaussian distribution to approach the standard Gaussian distribution:
where N is the number of the voxels. As suggested in [14], we set the hyper-parameter weight to be 0.1 to reach a good balance between the dice and VAE loss terms.
3 Expriment
3.1 Data Description
The BraTS 2020 training dataset includes 259 cases of HGG and 110 cases of LGG. All image modalities (T1, T1c, T2, and T2-FLAIR) are co-registered with image size of \(240 \,{\times }\, 240\, {\times }\, 155\) voxels and 1 mm isotropic resolution. The training data are provided with annotations, while the validation dataset (125 cases) and testing dataset (166 cases) are provided without annotations. Participants can evaluate their methods by uploading predicted segmentation volumes to the organizer’s server. Multiple times of submission for the validation evaluation are permitted, whereas only one submission is allowed for the final testing evaluation.
3.2 Implementation Details
Our network is implemented in Pytorch and trained on four NVIDIA P40 GPUs. Optimization. We use Adam optimizer with initial learning rate of \(lr_0=10^{-4}\) for weights updating. We progressively decay the learning rate according to the following formula:
where e is an epoch counter, and \(N_e\) is the total number of the epochs during training. In our case, \(N_e\) is set to 300.
Data Preprocessing. Before feeding input data into the first-stage network, we preprocessed the input data by applying intensity normalization to each MRI modality for each patient. The data is subtracted by the mean and divided by the standard deviation of the non-zero region. In the second stage, we crop the segmentation maps from the first-stage network into \(128\,{\times }\,128\,{\times }\,128\)-sized patches for each patient while ensuring that the patch includes most tumor voxels. The patches are concatenated with the normalized MRI images (after data augmentation, cropped at the same position) and fed to the second-stage network for training.
Data Augmentation. To reduce the risk of overfitting, three data augmentation strategies are used. First, the training data is randomly cropped into size of \(160\,{\times }\,192\,{\times }\,128\) before fed into the first-stage network. In addition, in both stages, we randomly shift the intensity of the input data by a value in \([-0.1, 0.1]\) of the standard deviation of each channel, and randomly scale intensity of the input data by a factor in [0.9, 1.1]. Finally, we apply random flipping along each 3D axis with a probability of 50%, in both stages.
Expanded Training Data. Since the training processes of the two stages are independent, we can select several first-stage trained models of competitive performance and use their segmentation results as the training data for training the second-stage network. Such a strategy trades a longer training process for better model performance and stability of results. Specifically, we select 6 individual first-stage models (of different epochs with different train-validation divisions) and combined their segmentation results into an extensive dataset to train the second-stage network (Fig. 4). Note that the train-validation division is based on patient IDs. The 6 segmentation results belonging to the same patient are consequentially grouped into the same set. We also have tried training the second-stage network using one single model’s segmentation result, but obtained only slight improvement compared to the first-stage network.
Postprocess. It is observed that when the predicted volume of ET is particularly small, the algorithm tends to predict TC voxels as ET falsely. In post-processing, based on our experience we replace ET with TC when the volume of predicted ET is less than 500 voxels.
Ensemble. We use majority voting to conduct model ensembling. In particular, if a voxel has equal votes in multiple categories, the final predicted category of the voxel is determined based on the average probability of each category.
4 Results
4.1 Quantitative Results
The validation dataset for BraTS 2020 includes 125 cases without providing tumor subtypes (HGG/LGG) or tumor subregion annotations. Table 1 reports the segmentation result of per-class Dice score and Hausdorff distance for the validation dataset evaluated by the official platform (https://ipp.cbica.upenn.edu/).
By comparing the segmentation performance of the 190th-epoch models of the two stages, we see that the improvement on accuracy brought by the presence of the second-stage network is more evident for TC than that for WT, and training the second-stage network with expanded training data further improves the Dice score for TC.
As a performance-boosting component, the second-stage network trained with expanded data can be added to any first-stage model to enhance the segmentation performance. The second-stage network with expanded data also reduces the performance variation across models of different epochs. Table 2 shows that the standard deviation (SD) of the TC’s Dice score and Hausdorff distance are reduced by 68% and 93% in the second-stage, respectively. The SDs are calculated based on the performance of all trained non-ensembled models. We also observe that the second-stage network remarkably reduces the variation of ET’s Dice score and Hausdorff distance, but this improvement no longer exists after post-processing.
The BraTS 2020 testing dataset contains 166 cases without providing tumor annotations. Our segmentation results on this dataset are presented in Table 3.
4.2 Attention Map
The attention matrices in the finest scale are visualised in the form of heatmap with red indicating higher weights and blue indicating lower weights (Fig. 5). In the first few training epochs, we observe that AGs grasp the tumor’s location and meanwhile assign a high weight to gray matter. As the training progresses, the weights assigned to non-tumor regions gradually decrease. AGs also suggest the model avoid misclassification of voxels around the tumor boundary by gradually decreasing weights assigned to those voxels.
5 Concluding Remarks
This paper proposes a two-stage cascade network with VAEs and AGs for 3D MRI brain tumor segmentation. The results indicate the second-stage network improves and stabilizes the prediction for all three tumor subregions, particularly for TC and ET (before post-processing). The second-stage network can also produce more qualified model candidates for further model ensembling. In this study, we use the segmentation results of multiple first-stage models to train the second-stage network. Though this helps improve the model’s prediction performance, it noticeably increases the training time as a trade-off. Consequentially, this technique may not be suitable for occasions with limited computing resources and research time. In addition, we can see from Table 1 that even if the expanded training data does not include the output of the first-stage 190th-epoch model, we can still use the trained second-stage models to obtain a better result based on the first-stage prediction. This indicates that the second-stage network trained by this strategy has generalizability among models of different epochs.
Since first proposed in natural language processing [17], the attention mechanism has been extensively studied and widely used in image segmentation problems. Technically speaking, the attention mechanism in image segmentation tasks can be divided into the spatial attention, such as the AGs used in our method, and the channel attention, e.g., the “squeeze and excitation” block in [9, 22]. It was proposed in [6] to combine the two kinds of attention in 2D problems, but multiplications between huge matrices involved in the method will likely exceed the computational limits in 3D scenarios. Further research is expected to include the appropriate combination of the two attention mechanisms into the brain tumor segmentation to enhance the segmentation accuracy. Besides, Dai et al. [5] utilized the extreme gradient boosting (XGboost) in model ensemble and gained extra improvement on accuracy as compared with the majority voting and probability averaging approaches. It may be worth integrating XGboost into our method, as the existence of the second-stage provides more models to be chosen from for the XGboost training. Moreover, Zhong et al. [21] has recently developed a segmentation network model that incorporates the dilated convolution [19] and the dense block [10]. The two popular deep-learning techniques may be valuable to be combined into our network structure.
References
Bakas, S., et al.: Segmentation labels and radiomic features for the pre-operative scans of the TCGA-GBM collection. The Cancer Imaging Archive (2017)
Bakas, S., et al.: Segmentation labels and radiomic features for the pre-operative scans of the TCGA-LGG collection. The Cancer Imaging Archive (2017)
Bakas, S., et al.: Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Nat. Sci. Data 4, 170117 (2017)
Bakas, S., et al.: Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. arXiv preprint arXiv:1811.02629 (2018)
Dai, L., Li, T., Shu, H., Zhong, L., Shen, H., Zhu, H.: Automatic brain tumor segmentation with domain adaptation. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds.) BrainLes 2018. LNCS, vol. 11384, pp. 380–392. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11726-9_34
Fu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Isensee, F., Kickingereder, P., Wick, W., Bendszus, M., Maier-Hein, K.H.: No new-net. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds.) BrainLes 2018. LNCS, vol. 11384, pp. 234–244. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11726-9_21
Jiang, Z., Ding, C., Liu, M., Tao, D.: Two-stage cascaded U-net: 1st place solution to BraTS challenge 2019 segmentation task. In: Crimi, A., Bakas, S. (eds.) BrainLes 2019. LNCS, vol. 11992, pp. 231–241. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46640-4_22
Menze, B.H., et al.: The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34(10), 1993 (2015)
Myronenko, A.: 3D MRI brain tumor segmentation using autoencoder regularization. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds.) BrainLes 2018. LNCS, vol. 11384, pp. 311–320. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11726-9_28
Oktay, O., et al.: Attention U-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
Tu, Z., Bai, X.: Auto-context and its application to high-level vision tasks and 3D brain image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 32(10), 1744–1757 (2009)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, G., Li, W., Ourselin, S., Vercauteren, T.: Automatic brain tumor segmentation using cascaded anisotropic convolutional neural networks. In: Crimi, A., Bakas, S., Kuijf, H., Menze, B., Reyes, M. (eds.) BrainLes 2017. LNCS, vol. 10670, pp. 178–190. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75238-9_16
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Zhao, Y.-X., Zhang, Y.-M., Liu, C.-L.: Bag of tricks for 3D MRI brain tumor segmentation. In: Crimi, A., Bakas, S. (eds.) BrainLes 2019. LNCS, vol. 11992, pp. 210–220. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46640-4_20
Zhong, L., et al.: (TS)2WM: tumor segmentation and tract statistics for assessing white matter integrity with applications to glioblastoma patients. Neuroimage 223, 117368 (2020)
Zhou, C., Chen, S., Ding, C., Tao, D.: Learning contextual and attentive information for brain tumor segmentation. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, T. (eds.) BrainLes 2018. LNCS, vol. 11384, pp. 497–507. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11726-9_44
Zhou, C., Ding, C., Lu, Z., Wang, X., Tao, D.: One-pass multi-task convolutional neural networks for efficient brain tumor segmentation. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11072, pp. 637–645. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00931-1_73
Acknowledgements
This research was partially supported by the grant R21AG070303 from the National Institutes of Health and a startup fund from New York University. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or New York University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Lyu, C., Shu, H. (2021). A Two-Stage Cascade Model with Variational Autoencoders and Attention Gates for MRI Brain Tumor Segmentation. In: Crimi, A., Bakas, S. (eds) Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2020. Lecture Notes in Computer Science(), vol 12658. Springer, Cham. https://doi.org/10.1007/978-3-030-72084-1_39
Download citation
DOI: https://doi.org/10.1007/978-3-030-72084-1_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72083-4
Online ISBN: 978-3-030-72084-1
eBook Packages: Computer ScienceComputer Science (R0)