Keywords

1 Introduction

Gliomas are brain tumors originated from the glial cells, and can be divided into Low Grade Gliomas (LGG) and High Grade Gliomas (HGG). Although the former are less aggressive, the mortality rate of the later is high [4, 19]. In fact, the most aggressive gliomas are called Glioblastoma Multiforme, with most patients not surviving more than fourteen months, on average, even when under treatment [29]. The accurate segmentation of the tumor and its sub-regions is important for treatment and surgery planning, but also for follow-up evaluation [4, 19].

Over the years, several approaches were proposed for brain tumor segmentation [4, 19]. Some probabilistic methods explicitly model the underlying data [9, 12, 20, 23]. In these approaches, besides the model for the tissue intensities, it is possible to include priors on the neighborhood through Markov Random Field models [20], estimate a tumor atlas at segmentation time [9, 12, 20] and take advantage of biomechanical tumor growth models [9, 12]. Agn et al. [1] used a generative method based on Gaussian Mixture Models and probabilistic atlases, extended with a prior on the tumor shape learned by convolutional Restricted Boltzmann Machines.

Other approaches learn a model directly from the data in a supervised way [3, 14, 17, 22, 27, 30]. In their core, all of these supervised methods have classifiers that learn how to classify each individual voxel into a tissue type, which may result in isolated voxels, or small clusters, misclassified inside another tissue; however, it is possible to regularize the segmentation by taking the neighborhood into account using Conditional Random Fields [3, 14, 17, 18]. Among the classifiers, Random Forests obtained some of the most promising results [17, 27, 30]. Bakas et al. [2] employed a hybrid generative-discriminative approach. The method is semi-automatic, requiring the user to select some seed points in the image. These points will be used in a modified version of Glistr [9] to obtain a first segmentation; then, it is refined with the gradient boosting algorithm. Lastly, a probabilistic refinement based in intensity statistics is used to obtain the final segmentation.

All the previous supervised methods require the computation of hand-crafted features, which may be difficult to design, or require specialized knowledge on the problem. On the other hand, Deep Learning methods automatically extract features [13]. In Convolutional Neural Networks (CNNs), a set of filters is optimized and convolved with the input image to compute certain characteristics; so, CNNs can deal with the raw data directly. Those filters represent weights of the neural network. Since the filters are convolved over the features, the weights are shared across neural units in the resulting feature maps. In this way, the number of weights in these networks is lower than in neural networks constituted by only fully-connected (FC) layers, making them less prone to overfitting [13]. Overfitting can be a severe problem in neural networks; so, Dropout appears as a regularization method that removes nodes of the network according to some probability in each training step, thus enforcing all nodes to learn good features [25]. Some methods employing CNN for brain tumor segmentation were already proposed [8, 10, 15, 28]. Havaei et al. [10] used a complex architecture of parallel branches and two cascaded CNNs; training of the network was accomplished in two stages: first with balanced classes and, then, a refinement of the last layer was accomplished using a number of samples of each class closer to the observed in brain tumors. Lyksborg et al. [15] trained a CNN in each of the three orthogonal planes of the Magnetic Resonance Imaging (MRI) images, using them as an ensemble of networks for segmentation. Dvořák and Menze [8] used CNNs for structured predictions.

Inspired by Simonyan and Zisserman [24], we developed CNN architectures using small \(3 \times 3\) kernels. In this way, we can have more convolutional layers, with the opportunity to apply more non-linear transformations of the data. Additionally, we use data augmentation to increase the amount of training data and Leaky Rectifier Linear Units (LReLU) as non-linear activation function. This approach and architecture obtained the second position in the 2015 BraTS challenge.

2 Materials and Methods

The processing pipeline has three main stages: pre-processing, classification through CNNs and post-processing; Fig. 1 presents an overview of the proposed method and interactions between the Training and Testing stages.

Fig. 1.
figure 1figure 1

Overview of the processing pipeline. During training, we artificially augment the data, but at test time we use just the original version of patches.

2.1 Data

BraTS 2015 [11, 19] includes two data sets: Training and Challenge. The Training data set comprises 220 acquisitions from patients with HGG and 54 from patients with LGG. Four MRI sequences are available for each patient: T1-, T1- post contrast (T1c), T2- and FLAIR-weighted. In this data set, the manual segmentations are publicly available. In the Challenge data set, both the manual segmentations and tumor grade are unknown. This set contains 53 subjects with the same MRI sequences as the Training set. All images were already rigidly aligned with the T1c and skull stripped; the resolution was guaranteed to be coherent among all MRI sequences and patients by interpolation of the sequences with thickest slices to 1 mm \(\times \) 1 mm \(\times \) 1 mm voxels.

2.2 Method

Given the differences between HGG and LGG, a model was trained for each grade. Thus, when segmenting a data set where the tumor grade is unknown, we require the user to visually inspect the images and identify the grade beforehand. After this procedure, the remaining pipeline is automatic, without requiring further intervention of the user, for example, to select parameters, seed points or regions of interest.

Pre-processing. The bias field in each MRI sequence was corrected using the N4ITK method [26]. This procedure was similar for all sequences, using 20, 20, 20 and 10 iterations, a shrink factor of 2 and a B-spline fitting distance of 200. After that, the intensities of each individual MRI sequence were normalized [21]. The method for this normalization procedure learns a standardized histogram with a set of intensity landmarks from the Training set, then, the intensities between two landmarks are linearly transformed to fit in the same landmarks of the standardized histogram; we selected 12 matching landmarks both in LGG and HGG. Finally, the patches are extracted in the axial slices and are normalized to have zero mean and unit variance in each sequence; the mean and variance are calculated for in each sequence using all training patches.

In brain tumor images the classes are highly imbalanced. There are much more samples of normal tissue than tumor tissue; additionally, among the tumor classes there are also classes more common than others, for example, edema represents a bigger volume than necrosis, which may even not exist in some patients. To cope with this, around 40 % of our training samples are extracted from normal tissue, while the remaining 60 % corresponds to brain tumor samples with approximately balanced numbers of samples across classes. However, since some classes are rare, the number of training samples of some tissues must be reduced to keep the classes balanced; so, during training each patch is rotated on the fly (in a parallel process) by 90, 180 and 270 to artificially augment the training data; at test time the patches are not rotated and we classify just the central voxel.

Convolutional Neural Network. In convolutional layers of CNNs the features are extracted by convolving a set of weights, organized as kernels, with the input. These weights are optimized during training to enhance different features of the images. The computation of the \(i^{th}\) feature map in layer l (\(F^{l}_{i}\)) is defined as

(1)

where f denotes the activation function, b represents the bias, j indexes the input channel, W denotes the kernels and \(X^{l-1}\) the output of the previous layer.

The architectures of the CNNs were developed following [24] and are described in Table 1; several variations were experimented, but these were found to obtain better results in the validation set. By using small kernels, we can stack more layers and have a deeper architecture, while maintaining the same effective receptive field of bigger kernels. For example, two layers with \(3\times 3\) filters have the same receptive field of one layer with \(5\times 5\) kernels, but we have fewer weights to train and we can apply two non-linear transformations to the data. We trained a deeper architecture for HGG than for LGG; adding more layers to the LGG architecture did not improve results, possibly because of the nature of LGG, such as its lower contrast in the core, when compared to HGG. The input consists in \(33\times 33\) axial patches in each of the 4 MRI sequences. Max-pooling consists in downsampling the features maps by only keeping the maximum inside a neighborhood of units in the feature maps; in this way, the computational load of the next layers decrease and small irrelevant details can be discarded. However, segmentation must also detect fine details in the image, thus, in our architectures, max-pooling is performed with some overlapping of the receptive fields, to keep important details for segmentation. In all the FC layers we use Dropout with \(p=0.5\) as regularization, in order to reduce overfitting. Besides preventing nodes to co-adapt to each other, Dropout works as an extreme case of bagging and ensemble of networks, since in each mini-batch there are different nodes exposed to a small and different portion of the training data [25]. LReLU was the activation function in almost all layers, expressed as

$$\begin{aligned} f(x)=\max {(0,x)}+\alpha \min {(0,x)} \end{aligned}$$
(2)

where \(\alpha \) denotes the leakyness parameter defined as \(\alpha =\frac{1}{3}\). Contrasting with ReLU, which imposes a constant 0 in the negative part of the function, LReLU has a small negative slope in that part of the function. This is useful for training, since imposing a constant forces the back-propagated gradient to become 0 in the negative values [16]. The loss function was defined as the Categorical Cross-entropy

$$\begin{aligned} H=-\sum _{j\in voxels}\sum _{k\in classes} c_{j,k} \log (\hat{c}_{j,k}) \end{aligned}$$
(3)

where \(\hat{c}\) denotes the probabilistic predictions (after the softmax activation function) and c denotes the target. Training is accomplished by optimizing the loss function through Stochastic Gradient Descent using Nesterov’s Momentum with momentum coefficient of 0.9. The learning rate \(\epsilon \) was initialized with \(\epsilon = 0.003\) and linearly decreased after each epoch during the first 25 epochs until \(\epsilon = 0.00003\). All convolutional layers operate over padded inputs to maintain its sizes in the output.

Table 1. Architecture of the CNN for HGG (left) and LGG (right). All non-linearities were LReLU, with the exception of the last FC layer, where softmax was used.

The CNNs were implemented using Theano [5] and Lasagne [7].

Post-processing. A morphological filter was applied to impose volumetric constrains. Consequently, the clusters are identified and we remove those with less than 10,000 voxels in HGG and 3,000 voxels in LGG.

2.3 Evaluation

Although we segment each image into five classes (normal tissue, necrosis, edema, non-enhancing tumor and enhancing tumor), the evaluation appraises three tumor regions: Enhancing tumor, Core (necrosis + non-enhancing tumor + enhancing tumor) and the Complete tumor (all tumor classes). To evaluate the segmentations, the following metrics were computed: Dice Similarity Coefficient (DSC), Positive Predictive Value (PPV), Sensitivity and Robust Hausdorff Distance. The DSC [6] measures the overlap between the manual and the automatic segmentation. It is defined as,

$$\begin{aligned} DSC = \frac{2TP}{FP + 2TP + FN}, \end{aligned}$$
(4)

where TP, FP and FN denote the numbers of true positive, false positive and false negative detections, respectively. PPV represents the proportion of detected positive results that are really positive and is defined as,

$$\begin{aligned} PPV = \frac{TP}{TP + FP}. \end{aligned}$$
(5)

Sensitivity measures the proportion of positive detections that are correctly identified as such and is useful to evaluate the number of true positive and false negative detections, being defined as

$$\begin{aligned} Sensitivity = \frac{TP}{TP + FN}. \end{aligned}$$
(6)

The metrics provided by the organizers for the Challenge set were DSC and robust Hausdorff Distance. The Hausdorff Distance measures the distance between the surface of computed (\(\partial P\)) and manual (\(\partial T\)) segmentation, as

$$\begin{aligned} Haus(\partial P, \partial T) = \max \{ \underset{p\in \partial P}{sup} \ \underset{t\in \partial T}{inf} \ d\left( p, t \right) , \underset{t\in \partial T}{sup} \ \underset{p\in \partial P}{inf} \ d\left( t, p \right) \} \end{aligned}$$
(7)

In the robust version of this measure, instead of calculating the maximum distance between the surface of the computed and manual segmentation, it is taken into account the 95 % quantile.

Fig. 2.
figure 2figure 2

Segmentation examples on the training data set from (a) HGG and (b) LGG. From left to right: T1, T1c, FLAIR, T2, manual segmentation and obtained segmentation. Colors in the segmentations represent: blue - necrosis, green - edema, yellow - non-enhanced tumor, red - enhanced tumor (Color figure online).

3 Results and Discussion

Some segmentation examples obtained in the Training data set are illustrated in Fig. 2, where we can observe the necrosis, edema, non-enhanced and enhanced tumor classes; quantitative results in the same set are presented in Table 2 and Fig. 3. These results were obtained by 2-fold cross-validation and 3-fold cross-validation in HGG and LGG, respectively. Observing Table 2, metrics in the Core and Enhanced regions of LGG are lower than in HGG, which may be due to the lower contrast of the former. In fact, the contrast in the Core region is lower in LGG [19] than in HGG. Additionally, although brain tumors are very heterogeneous, LGG tend to be smaller than HGG, with less Core tissues, as observed from the first and third rows of Fig. 2b. Another issue with LGG is the smaller number of training patients, when compared to HGG. From the boxplots in Fig. 3, we can observe the higher dispersion in the Core region of LGG compared to HGG; in the enhanced tumor in LGG the boxplots range almost the full scale of the metrics, possibly because some of these tumors do not possess enhancing tumor. However, the results for the Complete region are similar in LGG and HGG, with similar dispersion as observed in the boxplots. There are some outliers in Fig. 3, mainly in HGG, which may be due to the high variability of brain tumors and to the bigger amount of patients with HGG. Following the results in Table 2, in Fig. 2 the boundaries of the complete tumor seem well defined, both in LGG and HGG. However, from the second and third rows in Fig. 2b it seems that we are over-segmenting the Core classes in LGG; nevertheless, the second example looks particularly difficult with a big portion of tumor Core tissues in a very heterogeneous distribution, sharp shapes and details.

Table 2. Results (mean) obtained with BraTS 2015 training data set.
Fig. 3.
figure 3figure 3

Boxplot of the results in each of the evaluated brain tumor regions using the Training data set in (a) HGG and (b) LGG; black dots represent outliers

Fig. 4.
figure 4figure 4

Segmentation examples on the challenge data set. From left to right: T1, T1c, FLAIR, T2 and obtained segmentation. Colors in the segmentations represent: blue - necrosis, green - edema, yellow - non-enhanced tumor, red - enhanced tumor (Color figure online).

Figure 4 presents segmentation examples obtained in the Challenge data set, while Table 3 and Fig. 5 present the quantitative results. In this case, all subjects in each grade of the Training data set were used for training the CNN, with the exception of six validation patients in each grade. To train the CNNs we extracted around 4,000,000 training patches of HGG and 1,800,000 of LGG, and we used mini-batches of 128 training samples. However, the number of training patches was 4 times bigger due to the data augmentation. Observing Fig. 4, the segmentations seem coherent with the expected tumor tissues, for example, the enhanced tumor portions appear delineated following the enhancing parts in T1c. Also, the complete tumor appears to be well delineated, when comparing with the FLAIR and T2 sequences, where the edema is hyperintense.

Table 3. Results (mean) using the challenge data set of BraTS 2015.
Fig. 5.
figure 5figure 5

Boxplots of DSC and Robust Hausdorff Distance obtained using the challenge data set of BraTS 2015.

The training stage of each CNN took around one week. However, the entire processing pipeline takes approximately 8 min to segment each patient, using GPU processing on a Intel Core i7 3.5 GHz CPU, 32 GB of RAM, with a Nvidia Geforce GTX 980 computer running Ubuntu 14.04 OS.

4 Conclusions and Future Work

In this paper, we presented a CNN to segment brain tumors in MRI. Excluding when the user needs to identify the tumor grade, all steps in the processing pipeline are automatic. Although simple, this architecture shows promising results, with space for further developments, especially in the Core region and segmentation of LGG; in the Challenge data set the proposed method was ranked in the second position. As future work, we want to make a totally grade independent method, possibly through a joint LGG/HGG training or an automatic grade identification procedure before segmentation.