Keywords

1 Introduction

A tumour is an abnormal growth of cells that exist in a certain region of the body. Based on the above definition, brain tumours are mainly situated in the brain or central nervous system (CNS). The World Health Organization (WHO) has provided a classification of tumours [1] according to a grading system which ranges from Grade I to Grade IV, in increasing order of proliferative potential, indicating the potential rate and activity of the cells multiplying. Gliomas are brain tumours that arise from glial cells which are the supporting cells of the brain and spinal cord [2] can be separated into 2 main grades depending on their proliferative potential, namely low-grade gliomas (LGG) and high-grade gliomas (HGG). LGGs are benign tumours that are slow growing and have low potential to metastasize. HGGs, also known as glioblastoma multiforme [3] are malignant tumours that have aggressive growth rates and have high potential to metastasize.

Cancer Research UK estimates that there are 12,100 new brain, central nervous system (CNS) and intracranial tumours that were diagnosed from the years 2015 to 2017. Cancer Research UK [4] also states that brain, CNS and intracranial tumours are the 8th most common cancer in the UK in the year of 2017, with the median age range of prognosis being 40–44 years old in females, and 35–39 years old in males. According to Cancer Research UK, the survival rates of patients which have brain tumours are dependent on the type of tumour and age, but generally 40% of patients survive their cancer for 1 year or more, with more than 10% of patients surviving their cancer for 5 years or more [4] with survival rates of patients being dependent on various factors such as age, tumour behaviour, patient’s reaction towards treatment and tumour markers present in the body. In addition, recent research [5] has shown that the incidence of glioblastoma multiforme has increased by six times its original value between the years 2008 and 2017. Glioblastoma multiforme is also the leading type of brain tumour that occurs most frequently in adults [3] compared to other types of brain tumours.

As such, the early and accurate segmentation of brain tumours play an important role in the overall survival chance and treatment options of patients that are diagnosed with brain tumours. Various non-invasive imaging techniques such as MRI and Computed Tomography (CT) scans [6] are utilised to produce detailed images of the brain that are used to detect the presence of brain tumours in a patient. Manual segmentation of brain tumours from 3D volumetric imagery produced by MRI or CT scans are a time-consuming and intensive task [6, 7] as the operator has to perform segmentations slice by slice for a great number of slices to extract the boundaries of the target structure.

2 Background and Related Work

Before the rise of deep learning techniques in the medical imaging and computer vision domain, more traditional approaches were used to segment brain tumours from medical scans produced by MRI or CT procedures. Based on our understanding, traditional approaches can be classified as non-learning approaches which do not involve machine learning techniques that use some form of learning to find features and patterns in the image to segment the tumour. One such approach is by using thresholding that provides a straightforward technique by classifying pixels according to their intensity values to a certain defined threshold. Thresholding methods can be further split into two types [8], global and local thresholding. Global thresholding is used when an image has only two classes of interest and can be split distinctively using only a single threshold. If the image has more than two classes of interest, local thresholding will be a better choice. Global thresholding in brain tumour segmentation has been used [9] for the segmentation of enhancing tumour sections from T1-weighted images. By applying an intensity threshold to a manually selected region of interest in combination with a Sobel edge filter, the resulting image which highlights edge probability is used to determine the class of border pixels with respect to the edge probabilities. However, this technique has certain drawbacks [8] as it does not take into account pixels of hyper-intense signals that represent normal brain structures in T1-weighted images.

Another traditional approach to medical image segmentation is region-based approaches. Region growing is one such approach with the goal of extracting a region of the image based off some predefined homogeneity criteria [10]. In short, region growing requires a seed point which is manually determined, it then extracts neighbouring pixels that meet the homogeneity criteria and merges them into a region. The region will “grow” until the homogeneity criteria is not fulfilled. Related region growing approaches in brain tumour segmentation include [11] where two different kinds of homogeneity criteria were used in a modified region growing technique. The criteria of “intensity” and “orientation” were used as the homogeneity criteria. Pixels are chosen if both criteria are met, where the “intensity” criteria refer to a pixel-wise intensity value that must be over a certain threshold. The “orientation” criteria is a novelty in the region growing approach by calculating the difference in gradient of neighbouring pixels, and including the neighbouring pixel if it is below a certain threshold.

Despite the vast availability of traditional segmentation techniques, semi-autonomous techniques require manual intervention from human operators. Without proper domain expertise, these techniques could produce unfavourable results. However, autonomous techniques such as deep learning in the medical imaging and computer vision domain were quickly emerging, spurred by the success of the CNN architecture on the ImageNet dataset [12]. Despite that, due to the costly and time-intensive process of preparing labelled medical data, this proposed a challenge as to the practicality of deep learning in the medical imaging domain. Several approaches to confront the problem were proposed, such as transfer learning approaches to speed up the convergence and increase the accuracy of CNNs by transferring the knowledge gained [13] from learning from a non-medical domain, to a different but related domain. The study showed that the knowledge gained from the non-medical domain was able to be transferred to the medical domain by the process of fine-tuning and more training.

Following that, Fully Convolutional Networks (FCN) for semantic segmentation by [14] were introduced in 2014 which proposed a novel architecture that replaces the fully connected (Dense) layers in CNNs with convolutional layers that allows for variable size input as compared to nonconvolutional nets which accept fixed size input such as the architecture proposed by [15]. FCNs also introduce skip connections which concatenates output of lower layers to higher layers which in turn retain the global structure in predictions, leading to less loss of detail during the final predictions. The research adapted popular classification networks such as AlexNet [12], VGGNet [16] and GoogleNet [17] and transformed them into their convolutional counterparts, which achieved state-of-the-art accuracy on the PASCAL VOC 2011 dataset.The rise of FCN led to the development of U-Net [18], being one of the key contributors to the field of semantic segmentation. Despite its original intention being the segmentation of neuronal structures, U-Net has been shown to be applicable to various other imaging domains, and has been adapted for the segmentation of various objects with examples such as the pancreas [19], coral reefs [20], and even audio signals from human voices [21].

Despite the popularity of FCNs in recent years for brain tumour segmentation tasks, Wang et al. proposed a novel CNN architecture to tackle this task. The authors proposed a cascading CNN architecture [22] used in conjunction with anisotropic convolutions and by fusing the output of the cascade of CNNs in three orthogonal views to allow a more accurate and robust segmentation prediction. The authors approach the challenge in a hierarchical structure, using an individual CNN to create a bounding box of one class of tumour region, then feeding the output into the next CNN to create another bounding box of the next tumour region to create a binary segmentation problem. Anisotropic convolutions were introduced to reduce memory consumption by introducing a smaller receptive field with the trade off being that the network loses some global feature information. In addition, residual connections [23] are used in the inter-slice layers by adding the input of the block to the output, further encouraging the learning of residual functions from the input. Predictions were made by fusing the segmentation results from axial, sagittal, and coronal views. The authors managed to introduce an architecture that produces competitive accuracy scores and more efficient at test time compared to the more common FCN approaches.

An interesting submission during the BraTS 2018 challenge was the work performed by Andriy Myronenko utilizing an asymmetrical encoder-decoder based CNN architecture [24] with the encoder being the larger part and the decoder the smaller part. The larger encoder is responsible for extracting feature maps from the image while the smaller decoder is responsible to reconstruct the segmentation mask produced. The authors introduce additional branch to the endpoint of the encoder section which induces regularization to the architecture by using skip connections to transfer lower level features to higher levels of abstraction [25]. The author does not perform image augmentation as a pre-processing step, rather performing image augmentation at test time. The final submission by the author was an ensemble of 10 models which eventually took first place in the BraTS2018 challenge.

Kamnitas et al. constructed an architecture [26] with the goal of producing a more reliable and objective deep learning model which can generalize to various types of medical databases and robust to failures of individual components. The architecture was termed the Ensemble of Multiple Models and Architectures (EMMA). The authors construct an ensemble of models based on popular and well performing architectures in the medical imaging space which include two DeepMedic [27] models, 3 FCNs [14] and two 3D U-Net architectures [18]. Slight modifications were performed on all 3 architectures to adapt to this ensemble, such as doubling the number of feature maps in the DeepMedic at each layer, changing skip connections to be a summation of signals instead of a concatenation in U-Net and other changes. All models were trained individually and for predictions, their confidence maps for each class are created by calculating for each voxel, the class that it belongs to. EMMA assigns the voxel to a class with the highest confidence. This approach won the authors first place at the BraTS2017 segmentation challenge.

Jonas et al. proposed a transfer learning approach [28] which utilizes the ResNet34 encoder. The author extended upon AlbuNet [29] proposed by A. Shvets et al. The authors dropped the T1 modality from a the BraTS2020 dataset to match the 3-channel input of ResNe34. For the evaluation of the model, Jonas et.al used the validation set of the BraTS2020 challenge, in addition to using a private dataset obtained from a Syrian-Lebanese hospital that is situated in Brazil. The research shows that their model outperforms AlbuNet2D and a introduces a more robust training process with speedier convergence compared to models without pretraining.

Yixin et al. also contributed to the segmentation task of BraTS 2020, proposing a novel architecture to tackle the challenge. Yixin et.al proposed [30] a “Modality-Pairing Network” architecture. The authors split modalities into two groups,(T1, T1ce) and (T2, FLAIR) respectively for a dual-branch network that uses the 3D U-Net. The first branch uses the FLAIR and T2 modalities to extract the features of the whole tumour, with the second branch using the T1 and T1ce modalities to learn other feature representations of the tumour. Both branches are densely connected to learn the complementary information effectively. Another unique point of the paper was the usage of an ensemble of models to provide the segmentation labels of the highest priority, by averaging the sigmoid predictions of each trained single model and selecting the label with highest priority. The authors managed to win second place for the segmentation task at the BraTS 2020 challenge with their approach.

In the paper published by Fabian et al., [31] the authors have previously developed an automated framework named nnU-Net [32] for 3D biomedical image segmentation.The authors employed nnU-Net to the BraTS 2020 segmentation challenge with BraTS-specific optimizations to better score on the challenge. Such optimizations include a region-based training approach, splitting the entire tumour region into 3 subregions based off the BraTS labelling structure, which consists of “edema”, “non-enhancing tumour and necrosis” and “enhancing tumour”. Each subregion is then optimized independently by changing the objective function and optimization to all three tumour subregions instead of individually optimizing each subregion. The authors increased the probability of augmentations that may happen to their data sample which artificially increases the number of data points by applying changes to the original data points, thus increasing the generalizability of the model. Lastly, the authors developed an internal BraTS-like ranking system to more realistically gauge the models produced against the BraTS segmentation benchmarks, using the evaluation metrics of BraTS to decide on the ensemble of models to use for the competition. Based on all these efforts, the team achieved first place in the BraTS 2020 segmentation challenge and has proven that nnU-Net is generalizable across various medical imaging domains and provide state-of-the-art segmentation accuracy.

In our work, we extend upon the work by Jonas et al., we aim to fill in the gaps in the research by extending AlbuNet3D to accept all 4 input modalities. We believe that by discarding the T1 modality, some valuable knowledge and feature representations are lost. We also experiment with a different noise injection owing to research that justifies the distribution of signal intensities in MRI images when exposed to noise. We utilise a combination of these techniques and report our results.

3 Methodology

3.1 BraTS2020 Dataset

The publicly released BraTS2020 dataset consists of multimodal MRI scans of glioblastomas (HGG) and lower grade gliomas (LGG) which contains 369 training entries with ground truths and 125 validation entries without ground truths.The ground truth consists of the annotation of 3 different tumour regions, namely enhancing tumour (ET) with label 4, peritumoral edema (ED) with label 2 and the non-enhancing tumour core (NCT/NET) with label 1.

3.2 Extending the Input Channels

The original AlbuNet3D only utilised a 3-channel input due to the original nature of the ResNet34 encoder. T1ce, T2 and FLAIR modalities were in use for the original paper by Jonas et al. In our project, we explore the possibility of extending the original encoder to a 4-channel input. We replace the initial 2D convolutional layer in the ResNet34 encoder with another 2D convolutional layer, that contains 4 input channels. Next, we initialize the weights of the extra convolutional layer with the pretrained weights of the first convolutional layer. The reasoning behind our actions is to recover the knowledge representation from pretrained weights instead of a random initialization of weights which might not carry any pretrained knowledge, therefore reducing the effectiveness of the original transfer learning approach.

3.3 Pre-processing and Data Augmentation Policies

Fig. 1.
figure 1

Comparison between original image and transformations at slice 20 using a red heatmap plot, with all transformations applied on the last image.

All modalities undergo a Z-score normalization and are cropped to their non-zero regions to reduce subsequent computational time and memory usage. Various data augmentations are applied to all modalities such as spatial and geometric transforms by rotation, random cropping, elastic deformation, and mirroring along axes at 10% probability. Colour space transforms are then applied with 15% probability by increasing the pixel brightness multiplicatively, followed by a gamma transformation which introduces gamma correction as an augmentation. Lastly, we introduce Rician noise injection as our data augmentation technique instead of the regular Gaussian noise injection based on research [33] that intensity of MRI signals in the presence of noise follow a Rician distribution. We also experiment by doubling the probabilities of all data augmentations and transforms.

3.4 Training and Hyperparameters

Our optimizer of choice is the Adam optimizer with a learning rate of 1e−3. We use a minibatch size of 100 and a batch size of 12. Training was performed on all 369 training entries for 50 epochs. Our loss function is the Multiple Dice Loss [32] represented by Eq. 1:

$$\begin{aligned} L(X,Y) = -\frac{2}{K}\sum _{k\in K}\frac{\sum _i |X_k \cap Y_k|_i}{\sum _i |X_k|_i + \sum _i |Y_k|_i}, i\in I, k\in K \end{aligned}$$
(1)

where K represents the number of classes, X and Y represent predictions by the model and ground truth segmentations respectively.

4 Results and Discussion

Table 1. Dice score and Hausdorff distance of our experiments over 5 runs on test data
Table 2. Standard deviation of dice scores of our experiments over 5 runs on test data

As seen from Table 1, our approach by extending the input to 4 modalities (4C) with weight initialization (WI), using Rician Noise (RN) as the preferred noise injection yielded a marginal increase in accuracy on the Enhancing Tumour (ET) and Tumour Core (TC) classes. From Table 2, we also notice that the standard deviation of both classes mentioned shrinks marginally, signifying that the model is more robust towards outliers. However, without weight initialization, the 4-modality approach performs slightly poorly compared to 4C + WI + RN. We hypothesize that the weight initialization using existing pretrained weights helped to stabilize the training process and provide a more robust model due to the existing knowledge and features from the pretrained weights. Without pretrained weights, the new convolutional layer for the fourth modality is just initialized with random weights that might not provide any sort of learnt representations and knowledge to the model.

Aggressive data augmentation policies (denoted by DA) often resulted in a degradation of segmentation performance. By increasing the probability of all augmentations, more images in the training set are transformed and augmented. However, these aggressive augmentations did not provide any boost in accuracy, possibly due to the model losing its ability to generalize because the augmented samples could not reflect the possible deformities in actual MRI images. It is possible that a combination of different augmentations could be used to achieve a more robust model.

5 Conclusion

We show that our results outperform the original AlbuNet3D marginally by extending the input to four channels to accept all modalities of the BraTS dataset. In order to further improve on the accuracy of our extension, further work should focus on the initialization of weights when extending the pretrained ResNet34 encoder of AlbuNet3D or to extend the approach to other pretrained encoders that may perhaps provide a greater accuracy boost. It is crucial that this line of research may continue to develop adaptive methods that may process inputs of variable modalities in the future.

In conclusion, we have achieved the original aim and objectives of the project, which is to extend upon the gaps in literature to propose an improved algorithm which can provide enhanced segmentation accuracy. Further work should focus on the adaptive initialization of weights at the start of training when extending the input channels of a pretrained network to further stabilise the training process and achieve higher segmentation accuracy.