Keywords

1 Introduction

In the UK, the National Health Service (NHS) [16] reports over 11,000 patients yearly diagnosed with primary brain tumours originating in the brain with half of those diagnosed with a cancerous tumour. Beyond that many others are diagnosed with secondary brain tumours, whose cells have broken away from the primary tumour residing in another part of the body and have spread to the brain. Glioblastoma (GBM) is the most common and aggressive primary brain tumour in adults [14]. In particular, GBM is one group of the type Glioma that starts in the glial cells, which can be categorised into low-grade gliomas (LGG), which are benign and slow growing, or high-grade gliomas (HGG), which are malignant, fast growing and more aggressive with worse prognosis [4, 25]. To improve the survival rate of patients, brain tumour segmentation is an important task during diagnosis, treatment planning and treatment itself: The location and size of the tumour needs to be determined, the tumour boundary precisely outlined to expose very little healthy tissue to radiation, and the treatment progress evaluated and monitored [2, 26].

To detect gliomas, experts can use magnetic resonance imaging (MRI) to create images of the brain [5]. Usually the generated images are in 2D, putting stacks of 2D slices together produces a 3D model of the brain. Varying settings in MRI can produce different MRI sequences, resulting in different image appearance as shown in Fig. 3. Using different sequences for segmenting brain tumours lead to better results since the modalities provide complementary information for different glioma sub-regions [13].

Compared to manual segmentation, automatic brain tumour segmentation has proven the potential for more accurate and reliable results to aid precise treatment of brain tumours. Here the task is to segment brain MRI scans to highlight potential tumours with the aim to determine the outline of tumour structures as accurately as possible. Various machine learning approaches to medical image segmentation emerged over the last decades [12]. Significant progress has been achieved in recent years due to the advent of convolutional neural networks (CNNs) which have outperformed previous state-of-the-art methods [10, 12, 22].

A vast majority of end-to-end trained 2D and 3D CNN architectures have been proposed to tackle the task of medical image segmentation. According to Litjens et al. [12], most of the brain images are 3D volumes. Therefore, using a 2D network on 3D image volumes might be suboptimal due to unused image information along the missing third axis, which served as an inspiration of our work. In [21] the 2D U-Net which processes 2D slices for biomedical image analysis was proposed. Based on this, the 3D U-Net in [3] is capable of processing 3D image volumes by replacing all 2D operations with its 3D counterparts. In [8], DeepMedic, a dual-pathway 3D CNN combined with a fully connected 3D conditional random field (CRF) in a post-processing step for automatic brain lesion segmentation was proposed. By processing 3D MRI volumes, they make better utilization of 3D contextual information compared to the methods of [5, 17, 27], which process 2D slices. In [7] the DeepMedic model in [8] was extended with residual connections to achieve better segmentation results for all classes of the segmentation task. V-Net [15] is also a fully convolutional neural network for volumetric image segmentation and allows for processing 3D MRI volumes. In [11], HighRes3dNet for segmentation of fine structures in volumetric images was proposed. The network incorporates high spatial resolution feature maps throughout layers, which is different compared to the downsample-upsample [3, 8, 15] approach, where low-level features with high spatial resolutions will be downsampled at first and resulting feature maps will be upsampled to achieve high-resolution segmentation. The key idea of the architecture is to use dilated convolutions, which can be used to produce accurate dense predictions and detailed segmentation maps along object boundaries, and residual connections, which make information propagation smooth and improve the training speed. In [23], a cascade of three 2D CNNs (WNet, TNet, and ENet) was applied on three orthogonal views (axial, coronal, and sagittal) to segment brain tumour sub-regions hierarchically and sequentially from 3D MRI volumes. The segmentation outputs in the three orthogonal views will be fused for a more robust segmentation performance. Instead of using the WNet model for merely predicting the whole tumour as in [23], a WNet was trained in [24] for multi-class segmentation to predict all tumour sub-regions. Investigating CNN enhancements, there has been an approach based on introducing the autofocus layer, which adaptively change the size of the effective receptive field, for semantic image segmentation with CNNs [18]. Models such as [7] were augmented with autofocus layers for the task of brain tumour segmentation to demonstrate promising segmentation results.

In this work we present our end-to-end trained approach to brain tumour segmentation that leverages the capability of CNNs with autofocus layers to process 3D MRI volumes. The network is based on the WNet architecture, which consists of anisotropic, dilated convolutions, and up-/down-sampling to process stacks of 2D volumes [23]. Since most networks process 3D contextual information given 3D MRI volumes, we also propose to process them with 3D operations by adapting the WNet accordingly, in particular, all 2D operations were adapted to it’s 3D counterparts. Moreover, our proposed Autofocus Net comprises of autofocus layers based on the potential for generating more powerful features by adaptively changing the size of the effective receptive field. This was done by changing certain standard convolutional layers to autofocus layers. We provide a comparison (see Sect. 4) with models without autofocus layers which were evaluated on the same data. Compared with the original 2D WNet model from Wang et al. our resulting 3D model does not match their dice score. For whole tumour segmentation, their binary 2D WNet achieves a dice score of 89.97 whereas our proposed 3D models achieve 79.78 and 83.92 with autofocus layer settings {1, 2, 3} and {2, 4, 8, 12}, respectively.

2 Autofocus Net

In Fig. 1, we provide a schematic representation of our CNN. Our network is made up of a modified version of WNet [23] and autofocus (AF) layers [18].

Fig. 1.
figure 1

Schematic representation of our CNN architecture. Our network with autofocus layers and dilation rates {2, 4, 8, 12}. The last 6 blocks are autofocus blocks, consisting of two autofocus layers and residual connection as the default residual blocks. The numbers (2\(\times \) and 4\(\times \)) denote twofold and fourfold up-sampling, respectively.

Following the example of 3D U-Net, which is an adapted version of the originally proposed 2D U-Net as mentioned in the introduction, we adapted the 2D WNet accordingly to create a 3D WNet in the first step. Furthermore, considering the convolutions in an autofocus layer [18], which also use 3D shaped kernels with size 3 \(\times \) 3 \(\times \) 3, it was a necessary step to adapt the WNet accordingly to be able to experiment with autofocus layers. All decomposed kernels of size 3 \(\times \) 3 \(\times \) 1 (intra-slice) and 1 \(\times \) 1 \(\times \) 3 (inter-slice) in WNet were changed into 3D shaped kernels with size 3 \(\times \) 3 \(\times \) 3 and \(1\times 1\times 1\), respectively. Originally, WNet use 10 residual blocks consisting of two convolutional layers with a residual connection bypassing the parameterized layers in the network. Instead, only the first four residual blocks were kept and the last six residual blocks converted into AF blocks, each consisting of two autofocus layers with a residual connection.

Two Autofocus Net models were trained: one with dilation rates {1, 2, 3} and another one with dilation rates {2, 4, 8, 12}. Both were trained for the binary and multi-class case with all hyper-parameters adopted from the original configuration in [23] with a modification in batch size due to hardware limitations in our setup. The models were trained in axial, coronal and sagittal view and the results were fused (see Sect. 3.3) for evaluation. The Adaptive Moment Estimation (Adam) [9] optimiser was used for training, with initial learning rate 1e-3, weight decay 1e-7, regularisation type L2 and batch size 1. The size of the training image windows sampled from the image volumes was 96 \(\times \) 96 \(\times \) 96 for the modalities (T1, T2, T1c, FLAIR) and the labels (Seg). Due to class imbalanced data, the Dice loss function [15] was used for training the binary Autofocus Net. For the multi-class case, a combination of Dice and Cross-Entropy loss used in [6] was used for highly unbalanced segmentations.

2.1 WNet

The WNet network of the three cascaded networks from [23] has been implemented in NiftyNet [19]. In [23], the WNet is merely used for binary segmentation to segment the whole tumour (WT) in the first step before passing the segmentation result on to the second network in the sequence for further processing. However, in [24] a single WNet was trained for multi-class segmentation to segment all tumour sub-regions at once. This indicates the possibility to train the WNet for binary and multi-class segmentation (Fig. 2).

Fig. 2.
figure 2

WNet: Convolutional layers (blue) have a different ordering of the components (Batch Norm, Activation and Convolution) compared to the architecture in [23]. Numbers (2\(\times \) and 4\(\times \)) denote twofold and fourfold 2D up-sampling, respectively (Adapted from [23]). (Color figure online)

The architecture consists of 10 residual blocks with anisotropic convolutions, dilated convolutions, and multi-scale prediction to improve segmentation performance. Each of the 20 convolutional layers in a residual block is an intra-slice layer with kernel size 3 \(\times \) 3 \(\times \) 1. The four inter-slice convolutional layers have a kernel size of 1 \(\times \) 1 \(\times \) 3. This decomposition of a 3D kernel (3 \(\times \) 3 \(\times \) 3) into an intra-slice kernel and inter-slice kernel deals with anisotropic receptive fields. The 2D up-sampling layers make sure that all dimensions of the tensors are an equal size when applying the concatenate operation.

NiftyNet provides a pre-trained WNet model trained on the BraTS 2017 dataset, which we could have used as a sensible starting point for transfer learning to utilise the learning from the model. However, it was not possible to ensure that our test data is not in their training data to avoid biased results. Therefore, we retrained the WNet with our own training data from the BraTS 2018 dataset for binary and multi-class segmentation as a baseline for comparison with our Autofocus Net. In both cases, the WNet was trained in axial, coronal and sagittal view and the 3D results were fused as described in Sect. 3.3 to get an average segmentation result which reduces false positives. For comparison we also compared the results of our model with the pre-trained model on our test data to make inferences with the knowledge that the pre-trained model might be biased given that some of our test cases could have been in their training data due to overlapping HGG and LGG volumes in BraTS 2017 and BraTS 2018 dataset.

2.2 Autofocus Layer

The autofocus layer can be integrated into existing network architectures by replacing standard or dilated convolutions [18]. These layers adaptively change the size of the effective receptive field to generate more powerful features. In particular, autofocus layers enhance the multi-scale processing of CNNs by adaptively choosing the optimal scale for identifying different objects in an image. Additionally, it enhances the interpretability of a CNN as the computed attention maps, which are used as filters, unveiling how the zoom at different scales will be realised.

An autofocus layer, as depicted in [18], consists of an attention model and K (e.g. K = 4) parallel dilated convolutions each with a different dilation rate (rk) to capture multi-scale information. Given the activations of the previous layer Fl-1, each dilated convolution processes the same input Fl-1, where l denotes the layer number. Therefore, it is necessary to create K convolutions each with a different dilation rate. Note that all parallel convolutions must share the same weights to be adaptively scale invariant. However, there is no possibility in NiftyNet to share weights between multiple convolutions.

Our approach was to use the same convolutional layer multiple times, each time with another input. This simulates the desired parallelism and solves the weight sharing issue. However, using this approach does not solve the problem of having different dilation rates in each parallel layer. There is merely one convolutional layer for weight sharing as mentioned before with a certain dilation rate at a time. Hence, instead of creating K parallel convolutions with different dilation rates, K dilated input tensors each with a different dilation rate were created. Using those K tensors as input for a single convolution (dilation rate = 1) one after another simulates the desired parallelism and has the same effect.

3 Implementation Details

3.1 Data

We used MRI scans from 285 glioma patients from the 2018 BraTS Challenge [1, 14, 20] dataset. The training set consists of 210 HGG and 75 LGG cases. For each case, it includes 4 modalities (T1, T1c, T2, and FLAIR) that were co-registered and resampled to 1 mm3 isotropic resolution. In addition, segmentation files with pixel-wise ground truths (GT) that annotate the complete tumour are provided in the training set. For the validation set, it consists of 67 cases each with 4 modalities as in the training set. The GT segmentations of the validation cases are hidden and not provided within the dataset, which is why a random 70/20/10 (in %) split was generated out of the training set to get 199 training (145 HGG, 54 LGG), 57 validation (44 HGG, 13 LGG), and 29 inference (21 HGG, 8 LGG) cases, respectively. The matrix size for each patient is 240 (slice width) \(\times \) 240 (slice height) \(\times \) 155 (number of slices) \(\times \) 4 (modalities).

Fig. 3.
figure 3

MRI modalities. From left to right: T1, T1c, T2, FLAIR, and the ground truth segmentation. Tumour sub-regions: green–edema, yellow–enhancing tumour, and red–necrotic and non-enhancing tumour core. (Color figure online)

3.2 Training

All images processed by the network are from varying volume size due to cropping all images with a bounding box to remove unnecessary background, which is all voxels with intensity value zero. For example, a volume with dimensions 240 \(\times \) 240 \(\times \) 155 might result in 134 \(\times \) 167 \(\times \) 135. However, this has no effect other than reducing the image volume sizes across all images. Minimal pre-processing of normalizing the brain-tissue intensities of each sequence to have zero-mean and unit variance was performed. All networks were trained without data augmentation to ensure more comparability between network architectures. All models were trained for 30,000 iterations and saved every 5,000 iterations which leads to 6 model-checkpoints. The model-checkpoint with the best Dice score was chosen in each case and presented here. Training and evaluation was implemented on a standard workstation with NVIDIA GEFORCE GTX 1060 6 GB GPU. We used Python with the NiftyNet [19] library for implementing the proposed Autofocus Net\(^3\) and the autofocus layerFootnote 1.

3.3 Testing

To evaluate model performance, inferences were made on the 29 inference cases generated in the data split. A previously unseen MRI volume can be segmented by processing sampled image windows from an image through the network. The softmax output of the final classification layer of a network consists of a segmentation map, in which each voxel has been assigned a probability between 0 and 1 whether or not belonging to the corresponding tumour sub-region. These probabilities were used to calculate the reported measurements.

The respective 3D predictions for axial, coronal and sagittal view were post-processed. As in [23], fusing the three single predictions was conducted to get an average segmentation result of the three softmax outputs of all views. The output is an array from size H \(\times \) W \(\times \) D \(\times \) 1 \(\times \) C with C components; one component for each class. With two components (background and WT) in the binary cases and four components (background, WT, ET and TC) in the multi-class cases. When calculating the Dice across segmentation results and channel by channel (WT, ET and TC), all voxels with probability >0.5 are considered as belonging to the corresponding foreground class depending on the current channel. Every other voxel is interpreted as belonging to the background and not part of the class.

4 Results

For simplicity of visualisation, only slices of softmax segmentation results for Autofocus Net are shown. Table 1 shows quantitative comparison of the best mean Dice scores with standard deviation (Std) across all trained binary and multi-class models. For multi-class cases, some models have a higher mean Dice for WT but lower mean Dice for TC and ET compared to others. Therefore, the multi-class models with the highest sum of mean Dice scores (WT + TC + ET) were chosen.

Table 1. Quantitative comparison between the proposed approach and related networks on the BraTS 2018 dataset.
Fig. 4.
figure 4

Binary result: Softmax output, which assigns a probability to each voxel. (a) FLAIR. (b) GT - ground truth. (c) predicted binary segmentation. (d) fused prediction. (Color figure online)

Fig. 5.
figure 5

Multi-class result: Softmax output, which assigns a probability to each voxel. (a) FLAIR. (b) predicted whole tumour (WT). (c) predicted enhanced tumour (ET). (d) predicted tumour core (TC). (Color figure online)

Qualitative results from one inference case using our Autofocus Net for binary and multi-class segmentation, are shown in Fig. 4 and Fig. 5 respectively. The green colour in Fig. 4 (b) shows the segmentation of the whole tumour and represents the edema (label 2). Segmentation images used in Fig. 4 (c) are taken from the corresponding model trained in axial, sagittal or coronal view, respectively. The fused result, combining the three 3D segmentation results of each model is shown in Fig. 4 (d). The fused result per tumour sub-region and view is shown in Fig. 5. In both, a probability is given on a scale between 0 (blue) and 1 (red) whether or not belonging to the corresponding tumour sub-region. Iterations, which achieved best Dice scores are 2D WNet (20k/20k), 3D WNet (30k/30k), Autofocus Net {1, 2, 3} (20k/25k), and Autofocus Net {2, 4, 8, 12} (20k/30k) for (binary/multi-class) models, respectively.

5 Discussion

Our initial assumption that architectural modifications by replacing standard convolutional layers with autofocus layers can boost the segmentation performance proved incorrect as the results show otherwise. In both cases, binary and multi-class, the modified networks have relatively similar or worse dice score results compared to the corresponding model without autofocus layers. However, the networks using autofocus layers with four parallel convolutions perform better than autofocus layers with three parallel convolutions. This suggests an increased number of parallel convolutions enables the autofocus layer to handle the variability of brain tumour sizes more effectively. One drawback of the Autofocus Net is its increased number of additional (parallel) convolutional layers, and therefore a higher number of parameters, inside the autofocus layers that requires longer time for training compared with its non-autofocus counterpart. Regardless if its a binary or multi-class model, training the Autofocus Net using four parallel convolutions inside autofocus layers takes about 12.1% and 13.6% more time for the binary and multi-class model, respectively. For the binary model, our tests showed it took 117,179 s for with autofocus layers and 103,142 s without, and for the multi-class model, it took 124,837 s and 111,383 s for with autofocus layers and without, respectively. However, we believe this is not an important issue considering other networks with similar structure complexity.

The results show that our proposed approach achieved competitive performance for automatic brain tumour segmentation in the binary case. Our multi-class network results differ greatly (up to 20%) from the results in [24] even with a similar network structure and training hyper-parameters. We believe there are two major reasons for this. First, we used a random data split of the training set and thus had 86 MRI volumes (approx. 30%) less for training. Second, the used training hyper-parameters adopted from simpler network structures, in particular the number of training iterations, might be non-optimal for our model complexity. Given a larger time-scale, one could investigate the proper number of autofocus layers and settings (i.e. number of parallel convolutions and dilation rates) for the model with the best training settings to identify the best model for achieving better segmentation results. Additionally, NiftyNet comprises several degrees of freedom regarding pre-processing, such as normalisation, and training, such as data augmentation, optimizer settings or loss function. Choices in both steps are not independent from each other and impact the reproducibility of model performance in the literature heavily and thus an accurate segmentation result which can also be improved in the future.

6 Conclusion

In conclusion, we presented an approach based on a 3D CNN with autofocus layers that performs binary and multi-class segmentation of 3D brain MR images. We developed the first publicly available NiftyNet-based implementation of the autofocus convolutional layer [18] that was used in the proposed network. The approach in this paper was based on the methods proposed in [18, 23, 24], as we took advantage of them. We used the WNet and adapted it from 2D to 3D and additionally replaced a part of the last standard convolutional layers by autofocus layers with different autofocus settings. We evaluated the proposed method with different settings on the BraTS 2018 dataset to investigate the impact on the segmentation performance and compared the results with related architectures. Results show that with our best models we achieved at least similar results as the models proposed in literature, with an average dice score of 83.92 for the binary case using 3D Autofocus Net. Due to compute and time constraints we were unable to search through the whole hyper-parameter space of the number of parallel convolutions and dilation rates within each layer. However it appears that an increase in the number of parallel convolutions in the autofocus layers improves the models and therefore further investigation into optimising this further is still required.