Keywords

1 Introduction

Gliomas are the most common malignant brain tumors. Broadly, Gliomas are categorized into aggressive high-grade and slow-growing low-grade types. In both types of Gliomas, changes in tissues caused by tumor cells can be captured using multi-modality Magnetic Resonance Imaging (MRI). The commonly used modalities are T1, T2, contrast-enhanced T1 (ceT1), and FLAIR. These modalities are the default choice for the radiologist to identify the tumor type and its progression stage. Towards this objective, accurate and automatic brain tumor segmentation based on multi-parametric MRI is an active field of research [23] and could support diagnosis, surgery planning [11, 12], follow-up, and radiation therapy [1, 2]. The BraTS 2021 challenge has offered an unique and unprecedented opportunity to machine learning researchers to develop a clinically deployable solution for Glioma multi-class segmentation.

Fig. 1.
figure 1

Illustration of the 3D U-Net [10] architecture used. Blue boxes represent feature maps. IN stands for instance normalization [32]. The design of this 3D U-Net was determined using the heuristics of nnU-Net and our previous work [13, 16, 17, 21]. (Color figure online)

Aiming for computational efficiency, we use 3D U-Net, and its recent transformer variation, TransUNet [9], as the primary models and focus on finding better learning schemes, such as, augmentation, loss function, optimizer, and efficient inference routine for ensemble model. Recently it has been shown that different loss function combinations may have a crucial impact on the resultant segmentation [24]. In our settings, we use the Generalized Wasserstein Dice loss [15] that has shown superior segmentation performance as compared to the mean Dice loss [26, 28, 30] in the BraTS 2020 challenge [16] and for other medical image segmentation tasks [8, 31]. We investigate the effect of different state-of-the-art optimizers, such as, SGD, SGDP [20], ASAM [25]. Lastly, we use an efficient test-time ensemble approach for the final segmentation result.

2 Methods and Materials

2.1 Data

We have used the BraTS 2021 datasetFootnote 1 [3] in our experiments. No additional data were used. The dataset contains the same four MRI sequences (T1, ceT1, T2, and FLAIR) for all cases, corresponding to patients with either a high-grade Gliomas [5] or a low-grade Gliomas [6]. All the cases were manually segmented for peritumoral edema, enhancing tumor, and non-enhancing tumor core using the same labeling protocol [3, 4, 7, 27]. The training dataset contains 1251 cases, and the validation dataset contains 219 cases. MRI for training and validation datasets are publicly available, but only the manual segmentations for the training dataset are available. The evaluation on the validation dataset was performed using the BraTS 2021 challenge online evaluation platformFootnote 2. For each case, the four MRI sequences are available after co-registration to the same anatomical template, interpolation to 1mm isotropic resolution, and skull stripping [27].

2.2 Deep Learning Pipeline

We used the DynU-Net of MONAI [29] to implement a baseline 3D U-Net with one input block, 4 down-sampling blocks, one bottleneck block, 5 upsampling blocks, 32 features in the first level, instance normalization [32], and leaky-ReLU with slope 0.01. An illustration of the architecture is provided in Fig. 1. We have used the same pipeline for our participation to the FeTA challenge 2021 [14].

Transformers have recently received attention in medical image computing for their multi-hop attention mechanism. As a second network architecture, we replace the bottleneck block of the U-Net with a vision transformer as proposed by [9]. We use the identical transformer architecture for our experiment as in [9]. A transformer in the bottleneck allows to accumulate the global context of the image and learn an anatomically consistent representation of the tumor classes.

Table 1. Network architecture specification

Table 1 shows a comparison in terms of the number of parameters and inference time between 3D U-Net and transUNet. For both networks, we train using a patch size of \(128 \times 192 \times 128\).

2.3 Loss Function

We have experimented with two loss functions: the sum of the cross-entropy loss and the mean-class Dice loss

$$\begin{aligned} \mathcal {L}_{DL+CE} = \mathcal {L}_{DL} + \mathcal {L}_{CE} \end{aligned}$$
(1)

and the sum of the cross entropy loss and of the generalized Wasserstein Dice lossFootnote 3 [15, 16].

$$\begin{aligned} \mathcal {L}_{GWDL+CE} = \mathcal {L}_{GWDL} + \mathcal {L}_{CE} \end{aligned}$$
(2)

where \(\mathcal {L}_{CE}\) is the cross entropy loss function

$$\begin{aligned} \mathcal {L}_{CE}(\hat{{\textbf {p}}}, {\textbf {p}}) = - \sum _{i=1}^N \sum _{l=1}^{L} p_{i,l} \log (\hat{p}_{i,l}) \end{aligned}$$
(3)

with N the number of voxels, L the number of classes, i the index for voxels, l the index for classes, \(\hat{{\textbf {p}}}=\left( \hat{p}_{i,l}\right) _{i,l}\) the predicted probability map, and \({\textbf {p}}=\left( p_{i,l}\right) _{i,l}\) the discrete ground-truth probability map.

\(\mathcal {L}_{DL}\) is the mean-class Dice loss [26, 30]

$$\begin{aligned} \mathcal {L}_{DL}(\hat{{\textbf {p}}}, {\textbf {p}}) = 1 - \frac{1}{L}\sum _{l=1}^{L} \frac{2 \sum _{i=1}^N p_{i,l} \hat{p}_{i,l}}{\sum _{i=1}^N p_{i,l} + \sum _{i=1}^N \hat{p}_{i,l}} \end{aligned}$$
(4)

And \(\mathcal {L}_{GWDL}\) is the generalized Wasserstein Dice loss [15]

$$\begin{aligned} \left\{ \begin{aligned} \mathcal {L}_{GWDL}(\hat{{\textbf {p}}}, {\textbf {p}})&= 1 - \frac{ 2\sum _{l \ne b} \sum _{i} {\textbf {p}}_{i,l}(1 - W^M(\hat{{\textbf {p}}}_i, {\textbf {p}}_{i}))}{2\sum _{l \ne b}[ \sum _{i} p_{i,l}(1 - W^M(\hat{{\textbf {p}}}_i, {\textbf {p}}_{i})) ] + \sum _{i} W^M(\hat{{\textbf {p}}}_i, {\textbf {p}}_{i})}\\ \forall i,\quad W^M\left( \hat{{\textbf {p}}}_i, {\textbf {p}}_i\right)&= \sum _{l=1}^L p_{i,l}\left( \sum _{l'=1}^L M_{l,l'}\hat{p}_{i,l'}\right) \end{aligned} \right. \end{aligned}$$
(5)

where \(W^M\left( \hat{{\textbf {p}}}_i, {\textbf {p}}_i\right) \) is the Wasserstein distance between predicted \(\hat{{\textbf {p}}_i}\) and ground truth \({\textbf {p}}_i\) discrete probability distribution at voxel i. \(M= \left( M_{l,l'}\right) _{1 \le l,\,l' \le L}\) is a distances matrix between the BraTS 2021 labels, and b is the class number corresponding to the background. For the classes indices 0: background, 1: enhancing tumor, 2: edema, 3: non-enhancing tumor, we set

$$\begin{aligned} M = \left( \begin{array}{cccc} 0 &{} 1 &{} 1 &{} 1 \\ 1 &{} 0 &{} 0.7 &{} 0.5 \\ 1 &{} 0.7 &{} 0 &{} 0.6 \\ 1 &{} 0.5 &{} 0.6 &{} 0 \\ \end{array}{} \right) \end{aligned}$$
(6)

The generalized Wasserstein Dice loss [15] is a generalization of the Dice Loss for multi-class segmentation that can take advantage of the hierarchical structure of the set of classes in BraTS. When the labeling of a voxel is ambiguous or too difficult for the neural network to predict it correctly, the generalized Wasserstein Dice loss and our matrix M are designed to favor mistakes that remain consistent with the sub-regions used in the evaluation of BraTS, i.e., core tumor and whole tumor.

2.4 Optimization

Common Optimization Setting: For each network, the training dataset was split into \(95\%\) training and \(5\%\) validation at random. The random initialization of the weights was performed using He initialization [19] for all the deep neural network architectures. We used batch size 2. The CNN parameters used at inference corresponds to the last epoch. We used deep supervision with 4 levels during training. Training each 3D U-Net required 16GB of GPU memory.

SGD: SGD with Nesterov momentum. The initial learning rate was 0.02, and we used polynomial learning rate decay with power 0.9 for a total of 500 epochs.

ADAM [22]: For Adam, we used a linear warmup for 1000 iterations for the learning rate from 0 to 0.003 followed by a constant learning rate schedule at the value 0.003 for 500 epochs.

Adaptive Sharpness-Aware Minimization (ASAM) [18, 25]: We have used SGD as the base optimizer with the initial learning rate set to 0.02 and we used polynomial learning rate decay with power 0.9 for a total of 500 epochs. We used the default hyperparameters of ASAM [25], \(\rho =0.5\), and \(\eta =0.1\). We have used the PyTorch implementation of the authorsFootnote 4.

SGDP [20]: For SGD Projected (SGDP), we have used the exact same hyperparameter values as for SGD. We have used the PyTorch implementation of the authorsFootnote 5.

2.5 Data Augmentation

We have used random zoom (zoom ratio range [0.7, 1.5] drawn uniformly at random; probability of augmentation 0.3), random rotation (rotation angle range \([-15^{\circ }, 15^{\circ }]\) for all dimensions drawn uniformly at random; probability of augmentation 0.3), random additive Gaussian noise (mean 0, standard deviation 0.1; probability of augmentation 0.3), random Gaussian spatial smoothing (standard deviation range [0.5, 1.5] in voxels for all dimensions drawn uniformly at random; probability of augmentation 0.2), random gamma augmentation (gamma range [0.7, 1.5] drawn uniformly at random; probability of augmentation 0.3), and random right/left flip (probability of augmentation 0.5).

2.6 Inference

Single Models Inference: For the models evaluated and compared in Fig. 2, a patch-based approach is used. The input image is divided into overlapping patches of size \(128 \times 192 \times 128\). The patches are chosen, so that neighboring patches have an overlap of at least half of their volume. The fusion of the patch prediction is performed using a weighted average of the patch predictions before the softmax operation. The weights are defined with respect to the distance of a voxel to the center of the patch using a Gaussian kernel standard deviation for each dimension equal to \(0.125 \times \text {patch-dimension}\). In addition, test-time augmentation [33] is used with right-left flip. The two softmax predictions obtained with and without right-left flip are merged by averaging.

Ensemble Inference: For the ensembles, the inference is performed in two steps. During the first step, a first segmentation is computed using only one model and the inference procedure for single models. In practice, we used the first model of the list and did not tune the choice of this model.

The first segmentation is used to estimate the center of gravity of the whole tumor. In the second step, we crop a patch of size \(128 \times 192 \times 128\) with a center chosen as close as possible to the center of gravity of the tumor so that the patch fits in the image. The segmentation probability predictions of all the models of the ensemble are computed for this patch. The motivation for this two-step approach is to reduce the inference time as compared to using the patch-based approach described above for all the models of the ensemble. This strategy is based on the assumption that a patch of size \(128 \times 192 \times 128\) is large enough to always contain the whole tumor. During the second step, test-time augmentations with right-left flip and zoom with a ratio of 1.125 are used. The four segmentation probability predictions obtained for the different augmentations (no flip - no zoom, flip - no zoom, no flip - zoom, and flip - zoom) are combined by averaging the softmax predictions. For the full image segmentation prediction, the voxels outside the patch centered on the tumor are set to the background.

3 Results

As a primary metric, we report the mean and the standard deviation of the Dice score and the Hausdorff distance for each class. Percentiles are common statistics for measuring the robustness of automatic segmentations [13]. To evaluate the robustness of the different models, we report the percentiles of the Dice score at \(25\%\) and \(5\%\) and the percentiles at \(75\%\) and \(95\%\) of the Hausdorff 95% distance. In Table 2, we report the validation scores of our individually trained models. In Table 3 we compare two ensemble strategies as described in the previous section. In the ensemble models, we don’t include the TransUNet model as their individual performance is marginally worse than the 3D U-Net model.

Table 2. Segmentation results on the BraTS 2021 Validation dataset. The evaluation was performed on the BraTS online evaluation platform. ET: Enhancing Tumor, WT: Whole Tumor, TC: Tumor Core, Std: Standard deviation, \(p_x\): Percentile x. The split number corresponds to the random seed that was used to split the training dataset into \(95\%\) training/\(5\%\) validation at random.
Table 3. Segmentation results on the BraTS 2021 Validation dataset for ensembling and test-time augmentation. The evaluation was performed on the BraTS online evaluation platform. ET: Enhancing Tumor, WT: Whole Tumor, TC: Tumor Core, Std: Standard deviation, \(p_x\): Percentile x. Best values are in bold.
Table 4. Segmentation results on the BraTS 2021 Testing dataset using ensembling and test-time augmentation. The evaluation was performed by the BraTS 2021 challenge organizers using our docker submission. ET: Enhancing Tumor, WT: Whole Tumor, TC: Tumor Core, Std: Standard deviation, \(p_x\): Percentile x.

4 Discussion

From Table 2, we see that 3D U-Net trained with generalized Wasserstein Dice loss performs consistently better than the one with Dice loss (baseline model). TransUNet does not offer any improvement over the baseline. Rather the performance deteriorates slightly. We hypothesize that over-parameterization can be an issue in this case. The optimizer SGDP and ASAM perform similar to the baseline SGD. From Table 3, we see that ensemble strategy helps in increasing the robustness of the model. The best ensemble strategy turns out to be including zoom as a test time augmentation. This approach was submitted for evaluation on the BraTS 2021 testing dataset and the results can be found in Table 4. In conclusion, this paper proposes a detailed comparative study on the strategies to make a computationally efficient yet robust automatic brain tumor segmentation model. We have explored ensemble from multiple training configurations of different state-of-the-art loss functions and optimizers, and importantly, test-time augmentation. Future research will focus on further strategies on test-time augmentation and test-time hyper-parameter tuning.