Keywords

1 Introduction

Qualitative and quantitative assessment of brain tumors is the key to determine whether medical images can be used in clinical diagnosis and treatment. Researchers began to explore faster and more accurate methods for brain tumor segmentation. However, due to the fuzziness of the boundaries of each tumor subregion, the complete automatic segmentation of brain tumors remains challenging.

Brain Tumor Segmentation (BraTS) Challenge [1,2,3,4, 17] has always been focusing on the evaluation of state-of-the-art methods for the segmentation of brain tumors in multimodal magnetic resonance imaging (MRI) scans. BraTS 2020 utilizes multi-institutional pre-operative MRI scans and primarily focuses on the segmentation of intrinsically heterogeneous brain tumors, namely gliomas. The BraTS 2020 dataset is annotated manually by one to four raters, following the same annotation protocol, and their annotations are approved by experienced neuro-radiologists. Annotations comprise the background (label 0), the enhancing tumor (ET, label 4), the peritumoral edema (ED, label 2), and the necrotic and non-enhancing tumor (NCR/NET, label 1). Each patient’s MRI scan consists of four modalities, i.e., native T1 weighted, post-contrast T1-weighted (T1ce), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (T2-FLAIR).

Since the U-Net network was first proposed by Ronneberger et al. in 2015 [18], the neural network represented by U-Net and its variants has been shining brightly in the field of medical image segmentation. It is a specialized convolutional neural network (CNN) with a down-sampling encoding path and an up-sampling decoding path similar to an auto-encoder architecture. However, because the U-Net network takes as input two-dimensional data while medical images are usually three-dimensional, using the U-Net network will lose the spatial details of the original data. As a result, the image segmentation accuracy is not satisfying. Alternatively, 3D U-Net [8] was proposed and has been widely used for segmentation in medical image segmentation due to its outstanding performance. However, 3D U-Net network is prone to overfitting and difficult to train because of its huge number of model parameters, which greatly limits its application. Both 2D and 3D U-Net models have their own advantages and disadvantages. One question naturally arises, is it possible to build a neural network that computes as low cost as a 2D network, but performs as well as a 3D network?

Researchers have been investigating this question for a long time and numerous approaches have been proposed. Haquer et al. [9] proposed 2.5D U-Net which consists of three 2D U-Net trained with axial, coronal, and sagittal slices, respectively. Although it achieves the goal of lower computation cost of 2D U-Net and the effectiveness of 3D U-Net, it does not make full use of the spatial information of volumetric medical image data. Chen et al. [6] proposed S3D U-Net which uses separable 3D convolutions instead of 3D convolutions. Although its segmentation of medical images is efficient, the number of model parameters is still large, which greatly limits its application in practical scenarios. It is a challenging task to achieve both low computational cost and high performance. The key is how to explore the spatial-temporal modeling for the input volumetric data. Recently, spatial-temporal 3D networks have received more and more attention [15]. It performs 2D convolution along three orthogonal views of volumetric video data to learn the spatial appearance and temporal motion cues, respectively, and fuses together to obtain the final output. Inspired by this, we propose Multi-View Pointwise U-Net (MVP U-Net) for brain tumor segmentation. The proposed MVP U-Net follows the encoder-decoder-based 3D U-Net architecture in which we use three multi-view convolutions and one pointwise convolution to reconstruct the 3D convolution.

Meanwhile, the Squeeze-and-Excitation (SE) block, proposed by Hu et al. [11] in 2017, can be incorporated into existing state-of-the-art deep CNN architecture such as ResNet and DenseNet as a subunit structure, and it can further improve the generalization ability of the original network by explicitly modeling the interdependencies between channels and adaptively calibrating the characteristic responses of channel correlation. In view of this, we incorporate this block into our MVP U-Net after appropriate modification.

2 Methods

2.1 Preprocessing

The images are preprocessed according to the following three steps before fed into the proposed MVP U-Net. First, each image is cropped to the region of nonzero values, and, at the same time, the image is normalized to the [2.0, 98.0] percentiles of the intensity values of the entire image. Second, the brain regions of images for each modality are normalized by Z-score normalization. The region outsides the brain is set to 0. Third, batch generators (a python package maintained by the Division of Medical Image Computing at the German Cancer Research Center) are applied to do data augmentation, including random elastic deformation, rotation, scaling, and mirroring [12].

2.2 Network Architecture

MVP Convolution Block. The architecture of the proposed MVP convolution is shown in Fig. 1(a), where we divide a 3D convolution into three orthogonal views (axial, sagittal, coronal) in a parallel fashion, followed by a pointwise convolution. The pointwise convolution is part of the Depthwise Separable Convolution network first proposed by Google [7], which consists of a depthwise convolution and a pointwise convolution. Figure 1(b) shows the traditional 3D convolution.

Fig. 1.
figure 1

Comparison of MVP convolution and 3D convolution.

Figure 2 shows our MVP convolution block, which includes an activation function and an instance normalization, where our activation function is LeakyReLU (leakiness = 0.01). At the same time, in order to solve the problem of gradient disappearance caused by the increase of depth, we add the residual structure on the basis of the original structure. MVP convolution block is the main contribution of our proposed method. Each level of the network comprises two MVP convolution blocks of different resolutions.

Fig. 2.
figure 2

The architecture of the proposed MVP convolution block.

MVP U-Net Architecture. The proposed MVP U-Net follows the encoder-decoder based 3D U-Net architecture. Instead of traditional 3D convolution [19], we employ the multi-view convolution to learn spatial-temporal features and the pointwise convolution to learn channel features. The multi-view convolution performs 2D convolutions in three orthogonal views of the input data, i.e., axial, sagittal, coronal. The pointwise convolution [10] is used to merge the antecedent outputs. In this way, the generalization ability of the model can be improved while the number of parameters can be reduced.

The sketch map of the proposed network is shown in Fig. 3. Just like the original 3D U-Net [8], our network consists of three parts: 1) the left part corresponds to the contracting path that encodes the increasingly abstract representation of the input. Different layers are connected through an encoder module which consists of a \(3\times 3\times 3\) convolution with stride 2, padding 1 instead of max pooling; 2) the right part corresponds to the expanding path that restores the original resolution, and 3) the jump connection which corresponds to connecting the encoder results to the output of submodules with the same resolution in the encoder as input to the next submodule in the decoder.

Fig. 3.
figure 3

The architecture of the proposed MVP U-Net.

MVP U-Net with the SE Block Architecture. SE block consists of three operation modules: Squeeze, Exception, and Reweight. It is a new subunit structure which focuses on the characteristic channel. Among them, the Squeeze operation is to compress each feature channel in the spatial dimension and transform each two-dimensional feature channel into a real number. The real number has the global receptive field to some extent, and the output dimension matches the number of input feature channels. The Exception operation is a mechanism similar to the gate in recurrent neural networks. It can generate weights for each feature channel, which is learned to explicitly model the correlation between feature channels. The Reweight operation regards the output weight of the exception operation as the importance of each feature channel and then weights it to the previous feature channel by channel through multiplication. After the above steps, the recalibration of the original features in the channel dimension is completed.

However, it was originally proposed to improve the classification performance of two-dimensional images. We modify it properly so that it can be used in the classification of 3D feature map, and introduce it into our MVP U-Net after the concatenation section, as is shown in Fig. 4. It gives different weights to the features of different channels in the feature map after concatenation, in order to enhance those related features and suppress those less related features.

Fig. 4.
figure 4

The architecture of the proposed MVP U-Net with SE block.

2.3 Loss

The performance of a neural network depends not only on the choice of the network structure but also on the choice of the loss function, especially in the case of class imbalance. It holds for the task of brain tumor segmentation, in which the dataset varies in the size of classes [5, 14]. In this paper, a hybrid loss function is employed that combines a multiclass Dice loss, used for multi-classification segmentation, and a focal loss aimed to alleviate class imbalance. Our loss function can be expressed as follows,

$$\begin{aligned} L=L_{Dice} + L_{focal} \end{aligned}$$
(1)

The Dice loss is defined as,

$$\begin{aligned} L_{Dice}=\left( 1-\frac{2}{K} \sum _{k \in K} \frac{\sum _{i} u_{i}^{k} v_{i}^{k}}{ \sum _{i} u_{i}^{k}+\sum _{i} v_{i}^{k} }\right) \end{aligned}$$
(2)

where u is the softmax of the output map, v is the one-hot encoding of the corresponding ground truth label, i is the number of voxels of the output map and the corresponding ground truth label, k represents the current class, and K is the total number of classes.

The focal loss [16] is defined as,

$$\begin{aligned} L_{focal}=\left\{ \begin{array}{ll} -\alpha (1-y^{\prime })^{\gamma } \log y^{\prime } &{} ,\quad y=1 \\ -(1-\alpha )y^{\prime \gamma } \log (1-y^{\prime }) &{} ,\quad y=0 \end{array} \right. \end{aligned}$$
(3)

where \(\alpha \) and \(\gamma \) are constants. In our experiments, they are 0.25 and 2, respectively. y is the voxel value of the output map, and correspondingly, \(y^{\prime }\) is the voxel value of the ground truth label.

2.4 Optimization

We use Adam optimizer to train our model [13]. The learning rate decreases as the epoch increases, which can be expressed as

$$\begin{aligned} lr=lr_{0} * (1-\frac{i}{N_{i}})^{0.9} \end{aligned}$$
(4)

where i represents the current number of epochs, \(N_{i}\) is total number of epochs. The initial learning rate \(lr_{0}\) is set as \(10^{-4}\).

3 Experiments and Results

We use the data provided by Brain Tumor Segmentation (BraTS) Challenge 2020 to evaluate the proposed network. The training dataset consists of 369 cases with accompanying ground truth labels by expert board-certified neuroradiologists. Our model is trained on one GeForce GTX 1080Ti GPU in a Pytorch environment. The batch size is 1 and the patch size is set to \(160\times 192\times 160\). We concatenate four modalities into a four-channel feature map as input where each channel represents one modality. The results of our MVP U-Net on the BraTS 2020 training dataset are shown in Table 1, and the BraTS 2020 Training 013 case in training dataset with groundtruth and predicted labels are shown in Fig. 5.

Fig. 5.
figure 5

The BraTS 2020 Training 013 case of training dataset with groundtruth and predicted labels (yelow:NCR/NET, green:ED, red:ET). (Color figure online)

Table 1. Mean Dice, Hausdorff95, Sensitivity and Specificity on BraTS 2020 training dataset of the proposed method: original MVP U-Net. ET: enhancing tumor, WT: whole tumor, TC: tumor core.

The validation dataset and testing dataset contain 125 and 166 cases with unknown glioma grade and unknown segmentation, respectively. Ground truth segmentations for them are unknown and the evaluation is carried out via an online CBICA portal for the BraTS 2020 challenge. The models we have trained on the training dataset, including the original 3D U-Net, the original MVP U-Net, and MVP U-Net with SE block, are respectively used to predict the validation dataset of BraTS 2020, and the quantitative evaluation is obtained as shown in Table 2. As can be seen from the results, compared with the original 3D U-Net, the original MVP U-Net and the MVP U-Net with SE block has improved performance in most metrics. Meanwhile, the segmentation effect of the MVP U-Net with SE block is better than the original MVP U-Net.

Table 2. Mean Dice, Hausdorff95, Sensitivity and Specificity on BraTS 2020 validation dataset of the proposed methods: original MVP U-Net and MVP U-Net with SE block. ET: enhancing tumor, WT: whole tumor, TC: tumor core.

Finally, we used the MVP U-Net with the SE block to predict the testing dataset, and the results are shown in Table 3. Our method achieves average Dice scores of 0.715, 0.839, and 0.768 for enhancing tumor, whole tumor, and tumor core, respectively. The results are similar to those in the validation dataset, indicating that the model we designed has achieved desirable results in the automatic segmentation of multimodal brain tumors and the generalization ability of this model is also relatively powerful.

Table 3. Dice, Hausdorff95, Sensitivity and Specificity on BraTS 2020 testing dataset of the proposed method: MVP U-Net with SE block. ET: enhancing tumor, WT: whole tumor, TC: tumor core.

4 Conclusion

In this paper, we propose a novel CNN-based neural network called Multi-View Pointwise (MVP) U-Net for brain tumor segmentation from multi-model 3D MRI. We use three multi-view convolutions and one pointwise convolution to reconstruct the 3D convolution in conventional 3D U-Net, in which the purpose of multi-view convolution is to learn spatial-temporal features while pointwise convolution to learn channel features. In this way, the proposed architecture can not only improve the generalization ability of the network but also reduce the number of parameters. Further, we modify the SE block properly and introduce it into our original MVP U-Net after the concatenation section. Experiments showed that the performance of this method was improved compared with the original MVP U-Net.

During the experiment, we tried a variety of approaches. We found that the model performance could be improved by changing the encoders of the U-shaped network from max pooling to 3D convolution, and the results could also be improved by increasing the number of channels. Finally, the trained MVP U-Net with SE block was used to predict the testing dataset, and achieved mean Dice scores of 0.715, 0.839, and 0.768 for enhancing tumor, whole tumor, and tumor core, respectively. The results showed the effectiveness of the proposed MVP U-Net with the SE block for multi-modal brain tumor segmentation.

In the future, we will make further efforts in data preprocess and network architecture design to alleviate the imbalance of tumor categories and improve the accuracy of tumor segmentation.