Keywords

1 Introduction

Abdominal multi organ segmentation is of great significance in medical diagnosis and research. Through pixel level segmentation of CT or MRI and three-dimensional reconstruction of the segmentation results, doctors can obtain more intuitive information of patients’ abdominal organs [3, 4, 10, 17, 20]. In recent years, medical image automatic segmentation algorithm has made a great breakthrough. Methods based on deep learning has achieved excellent performance in this task [9, 12, 18, 19]. The deep learning technology based on neural networks can achieve fast segmentation, and effectively solve the problem of low accuracy and long time-consuming image segmentation [8, 15]. The research in recent years mainly focuses on the network structure and segmentation framework. At present, the most widely used network structure is the encoding-decoding shaped structure similar to U-Net [14], such as 3D U-Net [1] and V-Net [13], and nnU-Net [7] has also achieved excellent results in the field of segmentation framework. For example, in the MICCAI challenge 2019 kits19 competition, the accuracy of nnU-Net using 3D U-Net in the task of kidney segmentation is very close to that of human, but the required time to complete a segmentation is far less than that of manual segmentation. The deep learning-based methods not only surpass the traditional algorithms, but also approach the accuracy of manual segmentation. However, previous published studies are limited to be used on low-configuration devices.

In this paper, we propose a stable three-stage automatic segmentation method for abdominal 13 organs: liver, spleen, pancreas, right kidney, left kidney, stomach, gallbladder, esophagus, aorta, inferior vena cava, right adrenal gland, left adrenal gland, and duodenum. Our method can complete the segmentation task, including preprocessing the CT data, segmenting the multi-organ and finally post-processing the segmentation outputs, with low GPU memory occupation.

2 Methods

2.1 Preprocessing

In the preprocessing stage, we first standardize the spacing of CT. Due to the amount of available GPU memory, the patch size that can be processed in 3D CNNs is typically quite limited. Thus, the target spacing, which directly impacts the total size of the images in voxels, also determines how much contextual information the CNN can capture in its patch size. We reshape all the data with the voxel spacing of \(4.4 \times 2.5 \times 2.5\) mm for the first step and \(3.0 \times 1.6 \times 1.6\) mm for the second step. After spacing standardization, we set the maximum in-plain resolution to \(128 \times 176\) pixels for the first step and \(230 \times 300\) pixels for the second step, so as to prevent data with high original spacing from being too large after the standardization of spacing and resulting in a significant increase in segmentation time.

Fig. 1.
figure 1

The network architecture. Gray cuboids represent feature maps. The number of channels is denoted next to the feature map.

2.2 Proposed Method

To verify the impact of segmentation pipeline strategy on the results, we used an improved 3D U-Net as the segmentation network. The network architecture is illustrated in Fig. 1. The network includes an encoding path and a decoding path, each of which has four resolution levels. Each level of the encoding path contains two \(3 \times 3 \times 3\) convolution layers, and the convolution layers followed by a ReLu layer and a \(2 \times 2 \times 2\) Maximum pool layer with step size of 2. In the decoding path, each level also contains two \(3 \times 3 \times 3\) convolution layers , and the convolution layers followed by a ReLu layer and an upsampling layer.The summation between Dice loss and cross entropy loss is chosen as the loss function. We used adaptive moment estimation (Adam) as the optimizer. The batch size was set to be 2. The networks were initialized using Kaiming normal initialization. We set the learning rate to be 1e-3 and reduced the learning rate by a multiplier of 0.99 after every 5 epochs until it reached 1e-6.

The pipeline of our method consists of three stages: global locating, organ locating and organ segmentation. Each stage of our method will generate a segmentation result for the complete CT, and the operation of the second and the third steps are based on the previous result. As shown in Fig. 2, in the global locating stage, we first cut the original CT into several ROIs, and then segment each ROI with the first trained neural network. In the organ locating stage, we first locate the region of abdominal organs in the whole CT according to the results of the first step, and then we save this region with a higher resolution and segment it with the second trained network. In the stage of organ segmentation, we locate and crop each organ according to the results of the second step, and then use the corresponding network to fine segment each organ. Finally, we superimpose the segmentation results of each organ to the corresponding position and then generate the feature map of final segmentation result.

Fig. 2.
figure 2

The pipeline of our proposed method.

In order to further improve the robustness of the network on different data, we adopt the training strategy of semi-supervised learning.Since no research has proved that more unlabeled data in semi supervised learning is better, we set the unlabeled data as much as the number of labeled data. In the training process, we use 40 labeled data and 50 randomly selected unlabeled data as the training set used in the stage of global locating and organ locating. We use the labeled data to train the model in the first 50 epochs, and then introduce the unlabeled data.We use the trained model to segmentation the unlabeled dataset after each five epochs, and we use the results as the label for training. As the first two stages are the segmentation of complete CT, which is different from the third stage, we only use the semi-supervised learning strategy for the first two stages.

2.3 Post-processing

In the post-processing stage, we splice the results of the network segmentation. We keep the region with the largest volume and remove the rest to eliminate isolated incorrectly predicted labels. To improve the segmentation efficiency of our method, we clear the cache and delete the used feature map and the model from the GPU after each step. Finally, the maximum GPU memory we use is 1953MB.

3 Experiments

3.1 Dataset and Evaluation Measures

The FLARE2022 dataset is curated from more than 20 medical groups under the license permission, including MSD [16], KiTS [5, 6], AbdomenCT-1K [11], and TCIA [2]. The training set includes 50 labelled CT scans with pancreas disease and 2000 unlabelled CT scans with liver, kidney, spleen, or pancreas diseases. The validation set includes 50 CT scans with liver, kidney, spleen, or pancreas diseases. The testing set includes 200 CT scans where 100 cases has liver, kidney, spleen, or pancreas diseases and the other 100 cases has uterine corpus endometrial, urothelial bladder, stomach, sarcomas, or ovarian diseases. All the CT scans only have image information and the center information is not available.

The evaluation measures consist of two accuracy measures: Dice Similarity Coefficient (DSC) and Normalized Surface Dice (NSD), and three running efficiency measures: running time, area under GPU memory-time curve, and area under CPU utilization-time curve. All measures will be used to compute the ranking. Moreover, the GPU memory consumption has a 2 GB tolerance.

3.2 Implementation Details

Environment Settings. The environments and requirements are presented in Table 1.

Table 1. Environments and requirements.

Training Protocols. The Training protocols are presented in Table 2.

Table 2. Training protocols.

3.3 Resource Consumption

The Resource consumption during inference is presented in Table 3.

Table 3. Resource consumption during inference.

4 Results and Discussion

As the accuracy metrics, the average DSC between the predicted mask and the ground truth mask were employed. Assume A and B are two masks, the metric is given by (1).

$$\begin{aligned} \textrm{DSC}=\frac{2(A \cap B)}{A+B} \end{aligned}$$
(1)

4.1 Quantitative Results on Validation Set

Table 4 compares the experimental data on the segmentation results on 13 organs in the three stages. In the stage of global locating, organ locating and organ segmentation, our method achieves average DSC of 0.63, 0.73 and 0.77 respectively. The highest DSC between the three stages are highlighted in Table 4. It is apparent from this table that the DSC results in stage 3 is significantly higher than the previous stages.

Table 4. Comparison on 13 Structures on official validation and testing results.

As the models used in the first and the second stage were semi-supervised trained with unlabeled data, we also test the effect of unlabeled data. Table 5 shows the DSC comparison of our method with and without using unlabeled data. It can be observed that the accuracy of our method using unlabeled data has been improved.

Table 5. Comparison of our method with and without using unlabeled data.

4.2 Qualitative Results on Validation Set

Figure 3 shows three examples with good segmentation results on CT slices in validation set. Figure 4 shows the results with voxel-based rendering from three examples in the validation set. In these results, the performance of our method is generally stable.

Fig. 3.
figure 3

Three examples with good segmentation results on CT slices.

Fig. 4.
figure 4

Segmentation results with voxel-based rendering from three examples in the evaluation dataset. For each example, the ground truth and the segmentation results are given for visual comparison.

As shown in Fig. 5, there also have examples with bad segmentation results on CT slices in validation set. In the first case of the bad results, part of the right kidney tumor and pancreas were not correctly recognized. This is because there is not much data with kidney tumors in the training set, and the characteristic boundary between pancreas and surrounding tissues is not particularly obvious. In the second case, our method performs bad on spleen and stomach. The gray value of stomach is abnormally high in CT image, which not only led to the wrong recognition of the stomach, but also covered the correct label of spleen. In the third case, a typical liver recognition error occurred. Due to the rarity of such features in training data, the network habitually takes the lung boundary as the criterion for judging the region of liver.

Fig. 5.
figure 5

Three examples with bad segmentation results on CT slices.

5 Conclusion

We propose a three-stage automatic segmentation method for abdominal 13 organs based on improved 3D U-Net. The results show that the average dice of our method is 0.77 on the official validation leaderboard. The results show that the accuracy of our method on massive organs is better than that for small organs. The speed of three-stage method is fast, but it is difficult to achieve higher accuracy due to the limitation of feature map size. Future work will focus on promoting accuracy based on less stage methods, in which the segmentation speed can be further improved.