1 Introduction

Multi-organ segmentation has attracted considerable interest over the years. The recent success of deep learning-based classification and segmentation methods has triggered widespread applications of deep learning-based semantic segmentation in medical imaging [1, 2]. Many methods focused on the segmentation of single organs like the prostate [1], liver [3], or pancreas [4, 5]. Deep learning-based multi-organ segmentation in abdominal CT has also been approached recently in works like [6, 7]. Most of these methods are based on variants of fully convolutional networks (FCNs) [8] that either employ 2D convolutions on orthogonal cross-sections in a slice-by-slice fashion [3,4,5, 9] or 3D convolutions [1, 2, 7]. A common feature of these segmentation methods is that they are able to extract features useful for image segmentation directly from the training imaging data, which is crucial for the success of deep learning. This avoids the need for handcrafting features that are suitable for detection of individual organs.

However, most network architectures require severely downsampling or cropping the images for 3D processing to meet the memory limitations of today’s GPU cards [1, 7] while still considering enough context in the images for accurate segmentation of organs.

In this work, we propose a multi-scale 3D FCN approach that utilizes a scale-space pyramid with auto-context to perform semantic image segmentation at a higher resolution while also considering large contextual information from lower resolution levels. We train our models on a large dataset of manually annotated abdominal organs and vessels from pre-operative clinical computed tomography (CT) images used in gastric surgery and evaluate them on a completely unseen dataset from a different hospital, achieving a promising performance compared to the state-of-the-art.

Our approach is shown schematically in Fig. 1. We are influenced by classical scale-space pyramid [10] and auto-context ideas [11] for integrating multi-scale and varying context information into our deep learning-based image segmentation method. Instead of having separate FCN pathways for each scale as explored in other work [12, 13], we utilize the auto-context principle to fuse and integrate the information from different image scales and different amounts of context. This helps the 3D FCN to integrate the information of different image scales and image contexts at the same time. Our model can be trained end-to-end using modern deep learning frameworks. This is in contrast to previous work which utilized auto-context using a separately trained models for brain segmentation [13].

In summary, our contributions are (1) introduction of a multi-scale pyramid of 3D FCNs; (2) improved segmentation of fine structures at higher resolution; (3) end-to-end training of multi-scale pyramid FCNs showing improved performance and good learning properties. We perform a comprehensive evaluation on a large training and validation dataset, plus unseen testing on data from different hospitals and public sources, showing promising generalizability.

Fig. 1.
figure 1

Multi-scale pyramid of 3D fully convolutional networks (FCNs) for multi-organ segmentation. The lower-resolution-level 3D FCN predictions are upsampled, cropped and concatenated with the inputs of a higher resolution 3D FCN. The Dice loss is used for optimization at each level and training is performed end-to-end.

2 Methods

2.1 3D Fully Convolutional Networks

Convolutional neural networks (CNN) have the ability to solve challenging classification tasks in a data-driven manner. Fully convolutional networks (FCNs) are an extension to CNNs that have made it feasible to train models for pixel-wise semantic segmentation in an end-to-end fashion [8]. In FCNs, feature learning is purely driven by the data and segmentation task at hand and the network architecture. Given a training set of images and labels \(\mathbf {S} = \left\{ (X_n,L_n),\ n = 1,\dots ,N\right\} \), where \(X_n\) denotes a CT image and \(L_n\) a ground truth label image, the model can train to minimize a loss function \(\mathcal {L}\) in order to optimize the FCN model \(f(I,\varTheta )\), where \(\varTheta \) denotes the network parameters, including the convolutional kernel weights for hierarchical feature extraction.

While efficient implementations of 3D convolutions and growing GPU memory have made it possible to deploy FCN on 3D biomedical imaging data [1, 2], image volumes are in practice often cropped and downsampled in order for the network to access enough context to learn an effective semantic segmentation model while still fitting into memory. Our employed network model is inspired by the fully convolutional type 3D U-Net architecture proposed in Çiçek et al. [2].

The 3D U-Net architecture is based on U-Net proposed in [14] and consists of analysis and synthesis paths with four resolution levels each. It utilizes deconvolution [8] (also called transposed convolutions) to remap the lower resolution and more abstract feature maps within the network to the denser space of the input images. This operation allows for efficient dense voxel-to-voxel predictions. Each resolution level in the analysis path contains two \(3 \times 3 \times 3\) convolutional layers, each followed by rectified linear units (ReLU) and a \(2 \times 2 \times 2\) max pooling with strides of two in each dimension. In the synthesis path, the convolutional layers are replaced by deconvolutions of \(2 \times 2 \times 2\) with strides of two in each dimension. These are followed by two \(3 \times 3 \times 3\) convolutions, each followed by ReLU activations. Furthermore, 3D U-Net employs shortcut (or skip) connections from layers of equal resolution in the analysis path to provide higher-resolution features to the synthesis path [2]. The last layer contains a \(1\times 1\times 1\) convolution that reduces the number of output channels to the number of class labels K. This architecture has over 19 million learnable parameters and can be trained to minimize the average Dice loss derived from the binary case in [1]:

$$\begin{aligned} \mathcal {L}\left( X,\varTheta ,L\right) =- \frac{1}{K}\sum _{k=1}^K \left( \frac{2\sum _{i}^{N} p_{i,k}l_{i,k}}{\sum _{i}^{N} p_{i,k}+\sum _{i}^{N} l_{i,k}} \right) . \end{aligned}$$
(1)

Here, \(p_{i,k} \in [0,\dots ,1]\) represents the continuous values of the softmax 3D prediction maps for each class label k of K and \(l_{i,k}\) the corresponding ground truth value in L at each voxel i.

2.2 Multi-scale Auto-Context Pyramid Approach

To effectively process an image at higher resolutions, we propose a method that is inspired by the auto-context algorithm [11]. Our method both captures the context information at lower resolution downsampled images and learns more accurate segmentations from higher resolution images in two levels of a scale-space pyramid \(\mathbf {F} = \left\{ \left( f_s(X_s,\varTheta _s)\right) ,\ s = 1,\dots ,S\right\} \), with S being the number of levels s in our multi-scale pyramid, and \(X_s\) being one of the multi-scale input subvolumes at each level s.

Fig. 2.
figure 2

Axial CT images and 3D surface rendering with ground truth (g.t.) and predictions overlaid. We show the two scales used in our experiments. Each scale’s input is of size \(64\times 64\times 64\) in this setting.

In the first level, the 3D FCN is trained on images of the lowest resolution in order to capture the largest amount of context, downsampled with a factor of \(ds_1=2S\) and optimized using the Dice loss \(\mathcal {L}_1\). This can be thought of as a form of deep supervision [15]. In the next level, we use the predicted segmentation maps as a second input channel to the 3D FCN while learning from the images at a higher resolution, downsampled by a factor of \(ds_2 = ds_1/2\), and optimized using Dice loss \(\mathcal {L}_2\). For input to this second level of the pyramid, the previous level prediction maps are upsampled by a factor of 2 and cropped in order to spatially align with the higher resolution levels. These predictions can then be fed together with the appropriately cropped image data as a second channel. This approach can be learned end-to-end using modern multi-GPU devices and deep learning frameworks with the total loss being \( \mathcal {L}_\mathrm {total} = \sum ^L_{s=1}\mathcal {L}_s\left( X_s,\varTheta _s,L_s\right) \). This idea is shown schematically in Fig. 1. The resulting segmentation masks for the two-level case are shown in Fig. 2. It can be observed that the second-level auto-context network markedly outperforms the first-level predictions and is able to segment structuress with improved detail, especially at the vessels.

2.3 Implementation and Training

We implement our approach in KerasFootnote 1 using the TensorFlowFootnote 2 backend. The Dice loss [3] is used for optimization with Adam and automatic differentiation for gradient computations. Batch normalization layers are inserted throughout the network, using a mini-batch size of three, sampled from different CT volumes of the training set. We use randomly extracted subvolumes of fixed size during training, such that at least one foreground voxel is at the center of each subvolume. On-the-fly data augmentation is used via random translations, rotations and elastic deformations similar to [2].

3 Experiments and Results

In our implementation, a constant input and output size of \(64\times 64\times 64\) randomly cropped subvolumes is used for training in each level. For inference, we employ network reshaping [8] to more efficiently process the testing image with a larger input size while building up the full image in a tiling approach [2]. The resulting segmentation masks for both levels are shown in Fig. 3. It can be observed that the second-level auto-context network markedly outperforms the first-level predictions and is able to segment structures with improved detail. All experiments were performed using a DeepLearning BOX (GDEP Advance) with four NVIDIA Quadro P6000s with 24 GB memory each. Training of 20,000 iterations using this unoptimized implementation took several days, while inference on a full CT scan takes just a few minutes on one GPU card.

Fig. 3.
figure 3

Axial CT images and 3D surface rendering of predictions from two multi-scale levels in comparison with ground truth annotations. In particular, the vessels are segmented more completely and in greater detail in the second level, which utilizes auto-context information in its prediction.

Data: Our data set includes 377 contrast-enhanced clinical CT images of the abdomen in the portal-venous phase used for pre-operative planning in gastric surgery. Each CT volume consists of 460–1,177 slices of 512\(\times \)512 pixels. Voxel dimensions are \([0.59-0.98, 0.59-0.98, 0.5-1.0]\) mm. With \(S=2\), we downsample each volume by a factor of \(ds_1=4\) in the first level and a factor of \(ds_2=2\) in the second level. A random 90/10% split of 340/37 patients is used for training and testing the network. We achieve Dice similarity scores for each organ labeled in the testing cases as summarized in Table 1. We list the performance for the first level and second level models when utilizing auto-context trained separately or end-to-end, and compare to using no auto-context in the second level. This shows the impact of using or not using the lower resolution auto-context channel at the higher resolution input while training from the same input resolution from scratch. In our case, each \(L_n\) contains \(K=8\) labels consisting of the manual annotations of seven anatomical structures (artery, portal vein, liver, spleen, stomach, gallbladder, pancreas), plus background.

Table 2 compares our results to recent literature and also displays the result using an unseen testing dataset from a different hospital consisting of 129 cases from a distinct research study. Furthermore, we test our model on a public data set of 20 contrast-enhanced CT scans.Footnote 3

Table 1. Comparison of different levels of our model. End-to-end training gives a statistically significant improvement (\(p<0.001\)).
Table 2. We compare our model trained in an end-to-end fashion to recent work on multi-organ segmentation. [9] is using a 2D FCN approach with a majority voting scheme, while [7] employs 3D FCN architectures. Furthermore, we list our performance on an unseen testing dataset from a different hospital and on the public Visceral dataset without any re-training and compare it to the current challenge leaderboard (LB) best performance for each organ. Note that this table is incomprehensive and direct comparison to the literature is always difficult due to the different datasets and evaluation schemes involved.

4 Discussion and Conclusion

The multi-scale auto-context approach presented in this paper provides a simple yet effective method for employing 3D FCNs in medical-imaging settings. No post-processing was applied to any of the network outputs. The improved performance in our approach is effective for all organs tested (apart from the gallbladder, where the differences are not significant). Note that we used different datasets (from different hospitals and scanners) for separate testing. This experiment illustrates our method’s generalizability and robustness to differences in image quality and populations. Running the algorithms at a quarter to half of the original resolution improved performance and efficiency in this application. While this method could be extended to using a multi-scale pyramid with the original resolution as the final level, we found that the added computational burden did not add significantly to the segmentation performance. The main improvement comes from utilizing a very coarse image (downsampled by a factor of four) in an effective manner. In this work, we utilized a 3D U-Net-like model for each level of the image pyramid. However, the proposed auto-context approach should in principle also work well for other 3D CNN/FCN architectures and 2D and 3D image modalities.

In conclusion, we showed that an auto-context approach can result in improved semantic segmentation results for 3D FCNs based on the 3D U-Net architecture. While the low-resolution part of the model is able to benefit from a larger context in the input image, the higher resolution auto-context part of the model can segment the image with greater detail, resulting in better overall dense predictions. Training both levels end-to-end resulted in improved performance.