Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In recent years, convolutional neural networks (CNN) have become the technique of choice for state-of-the-art implementations in various image classification tasks [1,2,3,4,5]. In this study a novel fully convolutional residual neural network (FCR-NN) architecture is proposed for medical image segmentation synthesizing several recently described techniques including: (1) deep residual learning; (2) multi-resolution recursive topology; and (3) fully convolutional neural network architecture.

1.1 Deep Residual Learning

Deep residual learning [4] posits that any traditional neural network layer may be reformulated as a residual function by means of a linear transformation identity mapping with respect to the layer input:

$$\begin{aligned} x_{L} = Wx_{l} + f(x_{l}) \end{aligned}$$
(1)

Here the output layer, \(x_{L}\), is the sum of \(x_{l}\), the input layer, and \(f(x_{l})\), a residual change from the original input represented by an arbitrary sequence of non-linear functions. The projection matrix, W, is an optional term to match input and output layer dimensions. By preserving a linear relationship between the input and output layers, residual neural networks are able to stabilize gradients during backpropagation, leading to improved optimization and facilitating greater network depth. Networks based on this residual architecture have achieved top results in the ImageNet ILSVRC 2015 international challenge.

Image Segmentation Adaptation. In this study we hypothesize that the image segmentation problem can be similarly reformulated as a residual function, such that the final classification labels can be derived from a simple linear combination between the original input image, \(x_{l}\), and some arbitrary residual, \(f(x_{l})\). In addition to the advantages for optimization and network depth described above, this architecture benefits from using the original input image directly in the final classification task. This latter proposition may be counterintuitive at first; in fact a widely help assumption is that the raw image inputs contain too much inherent noise and variation to be useful directly for classification. However unique to medical image segmentation is that after accounting for high-level contextual anatomic cues, final segmentation is heavily influenced by low-level features of the original image, namely the signal intensity of any given voxel. For this reason we hypothesize that a deep residual architecture may be optimally suited for the medical image segmentation task.

1.2 Residual Functions for Multi-resolution Features

It is well-established that for the task of object localization a synthesis of contextual cues at multiple spatial resolutions is required for optimal performance [3, 6, 7]. For neural networks this has been implemented by combining feature maps from various layers, with the convolutional transpose operation used to upsample high-order deep layer activations to match those of more superficial layers [8]. Ronneberger et al. [9] further elaborated on this technique by proposing a symmetric contracting and expanding topology that efficiently combines low- and high-level features.

This study reformulates the latter strategy in the form of residual functions. Specifically, from Eq. (1) above, an explicit definition is made such that the residual function, \(f_{l}\), represents some arbitrary nonlinear transformation performed on the original image, \(x_{l}\), that captures its higher-level features and which is subsequently combined with its original lower-level properties. A more precise formulation of the higher-level feature map, \(x_{l+1}\), is given by:

$$\begin{aligned} x_{l+1} = \sigma _{l}(x_{l}) \end{aligned}$$
(2)

Here \(\sigma _{l}(x_{l})\) represents a sequence of convolutions and nonlinear activation functions which act to transform the original image, \(x_{l}\), to its higher-level feature map, \(x_{l+1}\). As a result, the original function \(f_{l}\) in Eq. (1) which was defined in the previous paragraph as the residua arising from the influence of higher-level features can be written as:

$$\begin{aligned} f_{l}(x_{l}) = x_{l+1} + f_{l+1}(x_{l+1}) = \sigma _{l}(x_{l}) + f_{l+1}(\sigma _{l}(x_{l})) \end{aligned}$$
(3)

Here \(f_{l+1}\) is a new residual function which like its predecessor represents an incremental change arising from the influence of even higher-level features from the first feature map \(x_{l+1}\). Expanding Eq. (1) recursively allows us to formulate the neural network for image segmentation as a simple series of residual functions, each term of which represents the additional incremental change of gradually higher-level spatial features:

$$\begin{aligned} x_{L} = Wx_{l} + f_{l}(x_{l}) = Wx_{l} + \sigma _{l}(x_{l}) + \sigma _{l+1}(\sigma _{l}(x_{l})) + ... \end{aligned}$$
(4)

The formulations as described by Eqs. (3) and (4) are visually demonstrated in Fig. 1.

Fig. 1.
figure 1

Recursive series of residual functions for image segmentation.

1.3 Fully Convolutional Neural Networks

Originally introduced by Long et al. [8] fully convolutional neural networks are implemented by a series of upsampling convolutional transpose operators performed on the deepest network layers, resulting in a final dense classification matrix equal in dimension to the original image size for each forward pass. In contradistinction to the typical sliding window CNN approach for image segmentation [10,11,12], a fully convolutional architecture is highly efficient. In the proposed FCR-NN where an entire axial MR slice is used as the input image, the number of required forward passes for classification of each patient is equal only to the number of slices in the MR volume, a task that can be achieved in less than a few seconds on a GPU optimized workstation.

Several additional less apparent benefits from fully convolutional neural networks include: (1) increased inherent regularization; and (2) influence of per voxel classification from a larger field-of-view. The first effect relating to increased regularization arises from observation that the loss function for each forward/backward pass is derived from contributions at each voxel within the input image. In this specific example with an input image size of 240\(\,\times \,\)240 voxels, the loss function is thus driven by \(240^{2}\) derivatives per slice. Furthermore the ratio of different tissue classes is reflected accurately by the conglomerate of axial slices in each mini-batch, and thus there is no requirement for multiphase training to account for class imbalances as is typically the case for non-fully convolutional designs [10].

Secondly, compared to a sliding scale CNN design whereby classification is limited to a small input patch, a fully convolutional network allows voxel-wise prediction to be influenced by a larger field-of-view proportional to the network depth. This effect is further complemented by the fact that residual networks allow for relatively deep architectures. In the proposed FCR-NN, convolution of the deepest 15 \(\times \) 15 feature map with a series of three 3\(\,\times \,\)3 kernels results in an effective receptive field that covers nearly one quarter (7\(\,\times \,\)7) of the feature map.

2 Related Work

Despite significant research over the past several decades, brain tumor segmentation remains a challenging task. Common strategies cited in the literature include: edge-based methods including active contours [13]; region-based methods [14]; classification or clustering methods with constraints based on atlases [15, 16], deformable models [17] or neighborhood regularization [18]; or hybrid generative-discriminative frameworks [19,20,21,22,23,24]. However these traditional methods are limited to a priori assumptions about a set of rules or features that best model the segmentation mask.

Convolutional neural networks are an emerging technique in computer vision increasingly recognized as the state-of-the-art approach for various image recognition tasks [25]. The power of neural networks arises from its unique capacity for unsupervised feature extraction, independently learning a high-order representation of the data that best approximates any given problem. In recent years, CNN-based algorithms have been adopted for various medical image segmentation tasks [1,2,3,4,5, 9]. In the BRaTS 2015 challenge, the 2nd [26] and 4th [10] ranked submissions were based on CNN architectures.

3 Network Architecture

A fully convolutional residual neural network (FCR-NN) is proposed for medical image segmentation. The network is composed by a series of recursive residual identity mapping blocks, the prototype of which is shown in Fig. 1.

Fig. 2.
figure 2

Residual identity mapping blocks for image segmentation.

The network comprises a total of 22 layers (18 regular convolutional kernels and 4 upsampling convolutional transpose kernels) and 661,700 parameters. The entire network architecture for a given single FCR-NN is diagrammed in Fig. 2.

Fig. 3.
figure 3

Fully convolutional residual neural network (FCR-NN) architecture for image segmentation. Abbreviations: Conv = 3\(\,\times \,\)3 convolution; Conv(s2) = 3\(\,\times \,\)3 convolution with stride 2 (downsample); Deconv = 2\(\,\times \,\)2 deconvolution (upsample); BN = batch normalization; ReLU = rectified linear unit

3.1 Fully Convolutional Residual Neural Networks

Convolutional Neural Networks. CNNs are an adaptation of the traditional artificial neural network architecture whereby stacks of 4D convolutional kernels and nonlinear activation functions act to transform a multidimensional input image into progressively higher-order feature representations [27]. The proposed CNN is implemented completely by 3\(\,\times \,\)3 convolutional kernels to prevent overfitting as described by [2]. No pooling layers are used; instead downsampling is implemented simply by means of a 3\(\,\times \,\)3 convolutional kernel with stride length of 2 to decrease the feature maps by 75% in size. All nonlinear functions are modeled by the rectified linear unit (ReLU) [28]. Batch normalization is used between the convolutional and ReLU layers to limit drift of layer activations during training [29]. In successively deeper layers the number of feature channels gradually increases from 16, 32, 64, 128 and 256, reflecting increasing representational complexity.

Deconvolutions. To upsample each high-level feature map, a deconvolutional or convolutional transpose operator is implemented by means of a 2\(\,\times \,\)2 kernel with a stride length of 2, resulting in an increase in feature map by 75%. At the same time, the number of feature channels is reduced by 50% to match the corresponding activation layer in the mirror image pathway (Figs. 2 and 3).

Residual Connections. As shown in Fig. 2 residual identity mappings are implemented by means of a channel-wise addition operation between an input and its corresponding deconvoluted feature map with the same activation layer size. The addition operation is performed after batch normalization of the deconvoluted output but before the nonlinear ReLU activation as suggested by [4].

3.2 Serial Architecture

The task of brain tumor segmentation as defined in the BRATS challenge can be divided into two corresponding goals. The first task is in differentiating between normal and abnormal brain tissue (whole tumor segmentation). Subsequently, using this result as an input, the second task is differentiate between various brain tumor tissue types. This process is diagrammed in Fig. 4.

Fig. 4.
figure 4

Implementation of serial fully convolutional residual neural networks (FCR-NN) for the two-part segmentation task.

As shown in the figure, these tasks are learned independently by two separate FCR-NNs. During test time, the segmentation mask generated from the first FCR-NN is dilated by 10 voxels; all voxels outside of this mask are set to 0 and the resulting masked image is used as an input for the second FCR-NN. By separating out these tasks explicitly, the two separate neural networks can be independently optimized for the highest performance for each respective goal.

3.3 Training Details

Image Preprocessing. Each MR volume was normalized simply to a mean signal intensity of 0.5 and standard deviation of 1/6 such that the range [0, 1] contains three standard deviations in each direction from the mean. No other image preprocessing was necessary.

Data Augmentation. Given the relatively limited training dataset available for the BRATS 2016 challenge, and more generally in the realm of medical imaging where annotated datasets are rare, judicious use of data augmentation is critical for successful CNN implementation. In this study two primary forms of data augmentation are employed. First, a separate cohort of IRB-approved patients at our institution with annotations by a board-certified radiologist were included to increase the breadth of data available for training. Importantly, these included a large number of patients with suboptimal brain extraction and/or significant imaging artifact. In addition, this institutional dataset was predominantly acquired at 3-Tesla MR imaging which, although yields higher sensitivity to pathology, also results in increased image noise and decreased contrast resolution particularly on T2-FLAIR sequences. It is important to point out that upon visual inspection, the test set for the 2016 BRaTS challenge compromised of a disproportionate of 3-Tesla MR imaging relative to the training data available online.

The second type of data augmentation employed by this study involves a number of real-time modifications to the source images at the time of training. Specifically, 50% of all images in a mini-batch were modified randomly by means of: (1) addition across all voxels of a scalar between [−0.1, 0.1]; (2) rotation of the image at an angle between 45\(^{\circ }\) to 45\(^{\circ }\); (3) addition of a bias field gradient of strength between [0, 0.4] at a random angle between 0\(^{\circ }\) and 360\(^{\circ }\).

Training Parameters. Training is implemented using standard stochastic gradient descent technique with Nesterov momentum [30]. Parameters are initialized using the heuristic described by He et al. [31]. L2 regularization is implemented to prevent over-fitting of data by limiting the squared magnitude of the kernel weights. To account for training dynamics, the learning rate is annealed and the mini-batch size is increased whenever training loss plateaus. Furthermore a normalized gradient algorithm is employed to allow for locally adaptive learning rates that adjust according to changes in the input signal [32].

Implementation. Software code for this study was written in Matlab (R2016a), and benefited greatly from the MatConvNet toolbox [33]. Experiments were performed on a GPU-optimized workstation with a single NVIDIA GeForce GTX Titan X (12 GB). The combined software and hardware configurations allowed for test time (forward pass only) classification of approximately 155.1 images per second. An entire brain volume with 151 axial slices could be classified in approximately 1.94 s (two cumulative forward passes for serial FCR-NNs).

4 Experiments and Results

All 274 cases of gliomas (220 high-grade gliomas; 54 low-grade gliomas) in the BRaTS 2016 training set were included in this study. An additional 150 institutional cases with high-grade gliomas were included to augment the available training data, all acquired on 3T MR scanners. A total of 186 cases in the BRaTS 2016 database (132/220 high-grade gliomas; 54/54 low-grade gliomas) and 84 institutional cases represented treatment naive (preoperative) tumors. The remaining cases were obtained at various postoperative time points.

Eighty percent of the cases (n = 339) were randomly assigned to the training data set, with the remaining 20% of the cases (n = 85) used as an independent validation set. All four channels (FLAIR, T1-precontrast, T1-postcontrast, T2) across an entire axial cross-sectional image slice (240\(\,\times \,\)240\(\,\times \,\)4 voxels) were used as inputs into the FCR-NN.

Dice scores and Hausdorff distances (mean and range) for the validation set data are reported in Table 1. The results are competitive with historic trends, and finished overall tied for the top two in the BRaTS 2016 challenge.

Table 1. Validation set dice score and Hausdorff distances.

Acknowledging limitations in direct comparison with the top four performing algorithms in the BRaTS 2015 challenge due to differences in validation data cited in published results (2013 BRaTS training set in [10, 34]; partial 2015 BRaTS training set in [23]; full 2015 BRaTS training set in [26] which is identical to the 2016 BRaTS training set used in this study), several general patterns can nonetheless be seen. With regards to complete tumor segmentation, the FCR-NN Dice of 0.89 is only marginally better than the previous top four algorithms (0.86–0.88). However given that expert human rater agreement for complete tumor segmentation is reported to be between 0.85–0.91 [35] this metric is likely approaching a plateau for theoretical upper limits of accuracy. By contrast, the much more visually challenging task of separating tumor subcomponents remains to be fully solved. In this case the FCR-NN approach yields a more significant improvement both in Dice score for core tumor (0.83 vs. 0.73–0.79 for previous top performing algorithms) and enhancing tumor (0.78 vs. 0.59–0.73). This is likely in part secondary to a serial architecture which dedicates an entire separate CNN to identification of tumor subcomponents (thus isolating the task from a second CNN that is responsible just for identifying tumor margins).

5 Conclusion

The proposed serial FCR-NN architecture implemented here in the setting of brain tumors is a robust method for medical image segmentation incorporating contextual cues from multiple spatial resolutions. The residual linear transformation identity mappings facilitate optimization and allow for direct contributions from the original input images for final classification. The fully convolutional architecture results in an efficient classifier which can segment an entire tumor volume in less than 3 s with state-of-the-art accuracy.