Keywords

1 Introduction

Glioma is a group of malignancies that arises from the glial cells in the brain. Nowadays, gliomas are the most common primary tumors of the central nervous system [1, 2]. The symptoms of patients presenting with a glioma depend on the anatomical site of the glioma in the brain and can be too common (e.g. headaches, nausea or vomiting, mood and personality alterations) to give an accurate diagnosis in early stages of the disease. The primary diagnosis is usually confirmed by magnetic resonance imaging (MRI) or computed tomography (CT) that provide additional structural information about the tumor.

Gliomas usually consist of heterogeneous sub-regions (edema, enhancing and non-enhancing tumor core, etc.) with variable histologic and genomic phenotypes [1]. Presently, multimodal MRI scans are used for non-invasive tumor evaluation and treatment planning, due to its ability to depict the tumor sub-regions with different intensities. However, segmentation of brain tumors in multimodal MRI scans is one of the most challenging tasks in medical imaging because of the high heterogenity in tumor appearances and shapes.

The brain tumor segmentation challenge (BraTS) [3,4,5,6] is aimed at development of automatic methods for the brain tumor segmentation. All participants of the BraTS are provided with a clinically-acquired training dataset of pre-operative MRI scans (4 sequences per patient) and segmentation masks for three different tumor sub-regions, namely the GD-enhancing tumor, the peritumoral edema, and and the necrotic and non-enhancing tumor core. The MRI scans were acquired with different clinical protocols and various scanners from multiple 19 institutions. Each scan was annotated manually by one to four raters and subsequently approved by expert raters.

The performance of proposed algorithms was evaluated by the Dice score, sensitivity, specificity and the 95th percentile of the Hausdorff distance.

2 Materials and Methods

2.1 SE Normalization

Normalization layers have become an integral part of modern deep neural networks. Existing methods, such as Batch Normalization [7], Instance Normalization [8], Layer Normalization [9], etc., have been shown to be effective for training different types of deep learning models. In essence, any normalization layer performs the following computations. First, for a n-dimensional input \(X = (x^{(1)}, x^{(2)}, \dots , x^{(n)})\), we normalize each dimension

$$\begin{aligned} x'^{(i)} = \frac{1}{\sigma ^{(i)}}(x^{(i)} - \mu ^{(i)}) \end{aligned}$$
(1)

where \(\mu ^{(i)} = \text {E}[x^{(i)}]\) and \(\sigma ^{(i)} = \sqrt{\text {Var}[x^{(i)}] + \epsilon }\) with \(\epsilon \) as a small constant. Normalization layers mainly differ in terms of the dimensions chosen to compute the mean and standard deviation [10]. Batch Normalization, for example, uses the values calculated for each channel within a batch of examples, whereas Instance Normalization - within a single example. Second, a pair of parameters \(\gamma _{k}, \beta _{k}\) are applied to each channel k to scale and shift the normalized values:

$$\begin{aligned} y_{k} = \gamma _{k}x'_{k} + \beta _{k} \end{aligned}$$
(2)

The parameters \(\gamma _{k}, \beta _{k}\) are fitted in the course of training and enable the layer to represent the identity transform, if necessary. During inference, both parameters are fixed and independent of the input X. In this paper, we propose to apply instance-wise normalization and design each parameter \(\gamma _{k}, \beta _{k}\) as functions of the input X, i.e.

$$\begin{aligned} \gamma&= f_{\gamma }(X) \end{aligned}$$
(3)
$$\begin{aligned} \beta&= f_{\beta }(X) \end{aligned}$$
(4)

where \(\gamma = (\gamma _{1}, \gamma _{2}, \dots , \gamma _{\mathrm {K}})\) and \(\beta = (\beta _{1}, \beta _{2}, \dots , \beta _{\mathrm {K}})\) - the scale and shift parameters for all channels, \(\mathrm {K}\) is a number of channels. We represent the function \(f_{\gamma }\) using the original Squeeze-and-Excitation (SE) block with the sigmoid [11], whereas \(f_{\beta }\) is modeled with the SE block with the tanh activation function to enable the negative shift (see Fig. 1a). This new architectural unit, that we refer to as SE Normalization (SE Norm), is the major component of our model.

Fig. 1.
figure 1

Proposed layers. Output dimensions are depicted in brackets.

2.2 Network Architecture

The widely used 3D U-Net [12, 13] serves as the basis to design our model. The basic element of the model, a convolutional block comprised of a \(3\times 3\times 3\) convolution followed by the ReLU activation function and the SE Norm layer, is used to construct the decoder (Fig. 2, blue blocks). In the encoder, we utilize residual layers [14] consist of convolutional blocks with shortcut connections (see Fig. 1b). If numbers of input / output channels in a residual layer are different, we perform a non-linear projection by adding the \(1\times 1\times 1\) convolutional block to the shortcut in order to match the dimensions (see Fig. 1c).

In the encoder, we perform downsampling applying max pooling with the kernel size of \(2\times 2\times 2\). To linearly upsample feature maps in the decoder, we use \(3\times 3\times 3\) transposed convolutions. In addition, we supplement the decoder with three upsampling paths to transfer low-resolution features further in the model by applying the \(1\times 1\times 1\) convolutional block to reduce the number of channels, and utilizing trilinear interpolation to increase the spatial size of the feature maps (Fig. 2, yellow blocks).

The first residual layer placed after the input is implemented with the kernel size of \(7\times 7\times 7\) to increase the receptive field of the model without significant computational overhead. The softmax layer is applied to output probabilities for four target classes.

To regularize the model, we add Spatial Dropout layers [15] right after the last residual block at each stage in the encoder and before \(1\times 1\times 1\) convolution in the decoder tail (Fig. 2, red blocks).

Fig. 2.
figure 2

Proposed network architecture with SE normalization. (Color figure online)

2.3 Data Preprocessing

Intensities of MRI scans are not standardized and typically exhibit a high variability in both intra- and inter-image domains. In order to decrease the intensity inhomogeneity, we perform Z-score normalization for each MRI sequence and each patient separately. The mean and standard deviation are calculated based on non-zero voxels corresponding to the brain region. All background voxels remain unchanged after the normalization.

2.4 Training Procedure

Due to the large size of provided MRI scans, we perform training on random patches of the size \(144\times 160\times 192\) voxels (depth \(\times \) height \(\times \) width) on two GPUs NVIDIA GeForce GTX 1080 Ti (11 GB) with a batch size of 2 (one sample per worker).

We train the model for 300 epochs using Adam optimizer with \(\beta _1 = 0.9\) and \(\beta _2 = 0.99\) for exponential decay rates for moment estimates, and apply a cosine annealing schedule gradually reducing the learning rate from \(lr_{max} = 10^{-4}\) to \(lr_{min} = 10^{-6}\) within 25 epochs and performing the learning rate adjustment at each epoch.

2.5 Loss Function

We utilize the unweighted sum of the Soft Dice Loss [16] and the Focal Loss [17] as the loss function in the course of training. The Soft Dice Loss is the differentiable surrogate to optimize the Dice score that is one of the evaluation metrics used in the challenge. The Focal Loss, compared to the Soft Dice Loss, has much smoother optimization surface that ease the model training.

Based on [16], the Soft Dice Loss for one training example can be written as

$$\begin{aligned} L_{Dice}(y, \hat{y}) = 1 - \frac{1}{\mathrm {C}}\sum _{c=1}^{\mathrm {C}} \frac{2\sum _{i}^{\mathrm {N}} y_{i}^{c} \hat{y}_{i}^{c} + 1}{\sum _{i}^{\mathrm {N}} y_{i}^{c} + \sum _{i}^{\mathrm {N}} \hat{y}_{i}^{c} + 1} \end{aligned}$$
(5)

The Focal Loss is defined as

$$\begin{aligned} L_{Focal}(y, \hat{y}) = - \frac{1}{\mathrm {N}}\sum _{i}^{\mathrm {N}}\sum _{c=1}^{\mathrm {C}}y_{i}^{c}(1 - \hat{y}_{i}^{c})^{\gamma }\ln (\hat{y}_{i}^{c}) \end{aligned}$$
(6)

In both definitions, \(y_{i} = \big [ y_{i}^{1}, y_{i}^{2}, \dots , y_{i}^{\mathrm {C}} \big ]^{\top }\) - the one-hot encoded label for the i-th voxel, \(\hat{y}_{i} = \big [ \hat{y}_{i}^{1}, \hat{y}_{i}^{2}, \dots , \hat{y}_{i}^{\mathrm {C}} \big ]^{\top }\) - predicted probabilities for the i-th voxel. \(\mathrm {N}\) and \(\mathrm {C}\) are the total numbers of voxels and classes for the given example, respectively. Additionally we apply Laplacian smoothing by adding +1 to the numerator and denominator in the Soft Dice Loss to avoid the zero division in cases when one or several labels are not represented in the training example. The parameter \(\gamma \) in the Focal Loss is set at 2.

The training data in the challenge has labels for three tumor sub-regions, namely the necrotic and non-enhancing tumor core (NCR & NET), the peritumoral edema (ED) and the GD-enhancing tumor (ET). However, the evaluation is done for the GD-enhancing tumor (ET), the tumor core (TC), which is comprised of NCR & NET along with ET, and the whole tumor (WT) that combines all provided sub-regions. Hence, during training we optimize the loss directly on these nested tumor sub-regions.

2.6 Ensembling

To reduce the variance of the model predictions, we build an ensemble of models that are trained on different splits of the train set and use the average as the ensemble prediction. At each iteration, the model is built on 90%/10% splits of the train set and subsequently evaluated on the online validation set. Having repeated this procedure multiple times, we choose 20 models with the highest performance on the online validation set and combine them into the ensemble. Predictions on the test set are produced by averaging predictions of the individual models and applying a threshold operation with a value equal to 0.5.

2.7 Post-processing

The Dice score used for the performance evaluation in the challenge is highly sensitive to cases wherein the model predicts classes that are not presented in the ground truth. Therefore, a false positive prediction for a single voxel leads to the lowest value of the Dice score and might significantly affect the average model performance on the whole evaluation dataset. This primarily refers to patients without ET sub-regions. To address this issue, we add a post-processing step to remove small ET regions from the model outcome if their area is less than a certain threshold. We set its value at 32 voxels since it is the smallest ET area among all patients in the train set.

3 Results and Discussion

The results of the BraTS 2020 segmentation challenge are presented in Table 1 and Table 2. The Dice score, Sensitivity and Hausdorff distance (HD) were utilized for the evaluation. Results in Table 1 were obtained on the online validation set with 125 patients without publicly available segmentation masks. The U-Net model was used as a baseline for comparison purposes. Final results on the test set consisted of 166 patients are shown in Table 2.

Table 1. Performance on the online validation set (\(n = 125\)). Average results are provided for each evaluation metrics.
Table 2. Performance on the test set (\(n = 166\)).

For all cases, the lowest average Dice score was obtained for the ET sub-region. This can be partially explained by the relatively small size of the ET class compared to the other tumor sub-regions that made segmentation of this class more challenging. The proposed model outperformed U-Net in all evaluation metrics except for the Dice score for the ET class. It is mainly caused by cases wherein the ET sub-regions were not presented. Combining multiple models into the ensemble allowed to address this issue since it reduced the chance to receive false positive predictions for the ET class as well as led to the better performance in terms of HD.