Keywords

1 Introduction

With its high-resolution image and low cost, CT scan is critical in clinical decision and holds the key for making precise medical-care accessible to everyone around the world. Recently, deep learning methods have been introduced to detect lesions in CT slices  [1,2,3,4,5]. Since it is difficult to distinguish lesions within a single axial slice, exploiting sufficient 3D context for accurate detection in volumetric CT data has emerged as a significant research focus.

Various architectures have been proposed for proper modeling of 3D context from neighboring CT slices. Yan et al.  [1] adopts a late fusion strategy which stacked 2D features of neighboring slices to build 3D context enhanced features. Although the pseudo-3D contextual information has provided prominent performance gain  [1,2,3,4,5], its late fusion strategy leads to notable losses of context information from early stages of the network. A direct way to address these issues is to employ 3D convolutions which introduce inter-slice connections hierarchically to learn 3D representations end to end. 3D convolutional filters can well preserve the 3D structure and texture information, but intensive memory and computation demands hinder its wide application in the universal lesion detection problem. What’s worse, although 3D network pre-training has raised significant research attention  [6,7,8,9], the lack of good pre-trained 3D models makes it even harder to achieve good performance with 3D based detectors.

In this paper, we focus on the problem of universal lesion detection in CT slices, where multiple adjacent CT slices are taken into consideration to localize 2D lesions for the target slice. We aim to develop a generic and efficient 3D backbone for 2D lesion detection with enhanced context modeling ability from multiple CT slices and devise a supervised pre-training method to boost its performance. Specifically, pseudo-3D convolutional filters  [8] which use depth-wise separable convolution are adopted to reduce the memory and computation overhead. The backbone in our method is a Modified Pseudo-3D ResNet (MP3D ResNet), which extracts context enhanced 3D features from multiple neighboring CT slices (9 in our case) and then converted the 3D features into 2D ones with a group transform module (GTM) for further 2D lesion detection in the target slice. Then, we feed backbone features extracted from MP3D ResNet into the neck of Feature Pyramid Network (FPN) to form the MP3D FPN for effective multi-scale detection. Finally, to facilitate efficient training of the MP3D FPN, we designed a novel supervised pre-training method, which exploits supervised signals from large-scale 2D natural image object detection dataset to pre-train the proposed MP3D detector. In summary, the main contributions of our paper are three folds:

  • 1. We have proposed a generic framework to employ 3D network for 2D lesion detection in CT slices. The proposed MP3D FPN is computational and memory efficient, and it achieves state-of-the-art performance on the DeepLesion dataset.

  • 2. We have derived a novel and effective way to adopt 2D natural images to pre-train 3D network with supervised labels, whose pre-trained weights can potentially benefit other 3D medical image analysis tasks (e.g. segmentation).

  • 3. We have conducted comprehensive experiments to explore the effects of pre-trained weights for deep medical image analysis. The results suggest that pre-trained weights can not only lead to faster convergence in all sized datasets, but also help to achieve better results in smaller-scale ones.

2 Methodology

Figure 1 gives an overview of the proposed lesion detection framework. The proposed MP3D FPN comprises an MP3D ResNet as the backbone, a 2D FPN  [12] as the neck and a 2D RPN/RCNN head. The MP3D ResNet takes multiple consecutive CT slices (e.g. 9) as input and generates 3D feature maps which bear the ability of 3D context modeling. Then a conversion block (GTM) further transforms the 3D feature maps into 2D ones for further 2D detection. Detailed architecture designs of the proposed MP3D backbone and the novel supervised pre-training scheme will be elaborated in the following sections.

Fig. 1.
figure 1

Overview of the proposed MP3D FPN. MP3D ResNet extracts context enhanced 3D features and converts them to 2D ones with a group transform module (GTM). These context enhanced 2D features are then fed into the FPN neck and the RPN/RCNN head for further 2D lesion detection. The MP3D FPN is pre-trained on Microsoft COCO object detection dataset  [15].

2.1 3D Context Modeling with an MP3D ResNet Backbone

In this work, we explore to employ 3D convolutions for effective 3D context modeling in the problem of lesion detection from consecutive CT slices (e.g. 9 slices). To advance the time and memory efficiency of normal 3D ResNet, we adopt the Pseudo-3D Residual Network (P3D ResNet)  [8] as the prototype of our backbone network. The pseudo-3D convolution simulates \(3\times 3\times 3\) convolution with \(1\times 3\times 3\) filter on axial-view slices plus \(3\times 1\times 1\) filter to build inter-slice connections on adjacent CT slices.

Lesion detection in CT slices aims to predict 2D bounding boxes in a certain slice, thus it requires 2D feature maps corresponding to the target slice for further prediction. Therefore, we need to convert the 3D feature maps to 2D ones for further prediction, meanwhile preserving the precise information of the target CT slice for accurate localization and classification. The designed Modified Pseudo-3D Residual Network (MP3D ResNet) highlights two aspects of modifications to fulfill such demands: 1) Instead of conducting isotropic pooling as in the original P3D ResNet, we neglect pooling operation in the inter-slice dimension. 2) A group transform module is introduced to generate the desired 2D feature maps from the context enhanced 3D features.

Neglecting pooling operation in the inter-slice dimension can help to preserve precise information of the target slice. In the meantime, since the number of input slices (e.g. 9) is rather small, we can get enough receptive field in the inter-slice dimension without downsampling. Regarding 2D feature map conversions, Fang et al.  [13] proposed to extract C 2D feature maps (\(1\times 1\times H\times W\)) corresponding to the center slice and concatenate them to form the converted 2D feature map of size (\(C\times H\times W\)). However, this method can not fully exploit the 3D context information resided in other adjacent slices.

We, on the other hand, propose a group transform module (GTM) instead to includes all slice’s features to compensate for the information loss. Specifically, we view 3D features (\(C\times D\times H\times W\)) into 2D (\(CD\times H\times W\)) and apply a group convolutional layer with the group size of C (every D channel is a group) to fuse all neighboring features to yield the final 2D feature maps (\(C\times H\times W\)).

Fig. 2.
figure 2

Comparison between 1) using 2D Image-Net pre-trained weights for multi-slice medical image analysis and 2) decomposing 2D natural image to simulate multi-slice medical image for 3D network pre-training.

Table 1. Sensitivities (%) at various FPs per image on the test set of DeepLesion. \(\mathbf {^*}\) indicates re-implementation of 3DCE using ResNet-50 FPN with the same configuration as our MP3D FPN.

2.2 Supervised 3D Pre-training with COCO Dataset

Supervised pre-training from natural images has proven to be an effective way for 2D medical image transfer learning [1,2,3,4,5, 10]. This indicates that using supervised pre-training models from another domain can actually benefit the medical image analysis application. What’s more, compared to self-supervised signals, we believe that supervised labels which carry the semantic information could enable the model to learn semantically invariant and discriminative features more effectively. Therefore, in this section, we aim to develop a method to exploit supervised labels from large-scale 2D natural image object detection dataset (e.g. coco  [15]) to pre-train our MP3D FPN.

Previous works  [1] have shown that by grouping 3 consecutive CT slices (which is natively 3D data) as a 3-channel RGB image, we can boost the detection performance with Image-Net pre-trained weights, indicating the feasibility of simulating RGB natural image with natively 3D CT slices. This inspires us to reversely decompose the 3 channels of natural RGB images into 3 consecutive CT slices, and train an MP3D FPN with such simulated 3D data. Figure 2 illustrates a comparison of the two correlative strategies. For implementation details, we train the MP3D FPN on COCO dataset for 72 epochs and the final weights are used to initialize MP3D ResNet. To drive the network to learn useful 3D contextual features from inter-slice connections, it is essential to keep the resolution in the inter-slice dimension unchanged for all stages of the backbone. The MP3D detector trained with a slice number of 3 can be used to initialize lesion detectors which takes variable number of slices as network input.

3 Experiments

3.1 Experimental Setup

Dataset and Metric: The NIH DeepLesion is a large-scale dataset for lesion detection, which contains 32,735 lesions on 32,120 axial CT slices captured from 4,427 patients. DeepLesion is splitted into training (70%), validation (15%), and test (15%) sets. We evaluate our MP3D FPN and all the compared methods on the test set by reporting the mean average precision (MAP@0.5) and average sensitivities at different false positives (FPs) per image.

Implementation Details: As in  [3], the Hounsfield units (HU) are clipped into the range of \([-1024,1050]\). We interpolate in the z-axis to normalize the intervals of all CT slices to 2.5 mm. Anchor scales are set to \(\{16, 32, 64, 128, 256\}\) in FPN. Apart from horizontal and vertical flip, we resize the image to different scales of \(\{448, 512, 576\}\) for data augmentation. MP3D-63 with group normalization [14] is used as the backbone in all our experiments, which has similar depth with the ResNet3D-50 model. The MP3D-63 model is derived from the conventional P3D-63 [8] model with the proposed modifications. Unless otherwise specified, the MP3D FPN takes 9 consecutive slices as input. We train all the models for 24 epochs at the base learning rate of 0.02, and reduce it by a factor of 10 after the 16-th and 22-th epoch (corresponding to the 2x learning schedule [11] on COCO dataset). We conduct experiments on the NVIDIA TITAN V GPU with 12 GB of memory, and mixed-precision training strategy is used in all our experiments to save memory.

3.2 Comparison with State-of-the-arts

Table 1 presents the comparisons with the previous state-of-the-art (SOTA) methods. Our model surpasses all the SOTA methods on sensitivities at different FPs and MAP@0.5, which includes 3DCE [1], MSB [2], RetinaNet [3], MVP-Net [4] and MULAN [5].

Without using any auxiliary supervision, MP3D FPN outperforms MULAN, the previous SOTA which additionally employs multi task learning and a deeper backbone (DenseNet-121) to improve the detection accuracy, by up to 3.48% on the sensitivity of FPs@0.5. We re-implement 3DCE with ResNet-50 FPN using the same configuration as our MP3D FPN for fair comparison. Our proposed MP3D achieve a performance gain of 6.05% on MAP@0.5 compared with this 2D convolution based context encoding method, demonstrating the superior 3D context modeling ability of our MP3D backbones. As shown in Table 1, MP3D FPN (248.93 GFLOPS, 45.16 M Params) and MR3D FPN (Modified ResNet 3D, 415.81 GFLOPS, 64.03 M Params) based detector achieve comparable results, but the MP3D based detector consumes much less time and memory. This strongly proves the efficacy and the thrift of our MP3D model.

3.3 Ablation Study

We perform a number of ablations to probe into our MP3D FPN. The results are shown as follows:

Table 2. Detection performance and computational cost with variable numbers of input slice. GFLOPS is used to characterize the computational cost.

Input Slices: Table 2 shows the performance of the MP3D detector when applying 5,7,9 and 11 slices as input. The detector achieves higher detection accuracy as more slices are used, meanwhile consuming more time and memory. MP3D with 7 slices as input get the best trade-off between effectiveness and efficiency.

Table 3. Comparison of different conversion modules and different pooling strategies for pre-training.

Conversion Type: Table 3 demonstrates the comparisons of proposed GTM with the center-cropping transform module (CTM), which is proposed by Fang et al.  [13]. The proposed GTM brings better results as it can efficiently aggregate information from all adjacent slices for further detection.

3.4 Effectiveness of the 3D Pre-trained Model

We conducted three groups of experiments to explore effectiveness of the pre-training method.

Comparison to Isotropic Pooling: In this work, to achieve 3D context modeling ability in the z-axis, we neglect pooling operation in the inter-slice dimension when pre-training the MP3D model on the Microsoft COCO dataset. We compared our proposed method to isotropic pooling for validation.

The pre-trained model takes three slices as input. When training with isotropic pooling, the z-axis degenerates to a single slice after the first two pooling layers, preventing further 3D convolution layers from learning useful 3D contextual information. As shown in Table 3, pre-trained weights learned from isotropic pooling gives worse results than the proposed method. This also proves that using decomposed natural image as input can actually helps the 3D model to gain context-encoding ability. Thus the learned weights can potentially be used to boost the performance of other 3D medical image analysis tasks.

Table 4. Comparison of model performance with and without pre-training with different learning schedules. 1x, 2x and 6x indicates max training epochs of 12, 24 and 72 separately.
Table 5. Training with variable dataset sizes (\(100\%\) to \(20\%\)). For simplicity, we present the results of MAP@0.5.

Comparison to Training From Scratch: He et al. [11] demonstrated that with sufficient training data (around 35k from its experiment) and longer training schedule (6x), models trained from scratch could achieve comparable results to models training with pre-trained weights. Therefore, we examined the effectiveness of our proposed pre-training method by comparing MP3D with pre-training to model trained from scratch with longer schedule.

As shown in Table 4, when both trained for 1x learning schedule (12 epochs), MP3D with pre-trained weights significantly outperforms the one without pre-training, demonstrating faster convergence speed. And it turns out that with 2x learning schedule (24 epochs), model trained with the proposed pre-training weights can achieve comparable results with MP3D model trained from scratch with 6x learning schedule (72 epochs). These results validate the effectiveness of our proposed pre-training scheme.

Performance on Variable Dataset Sizes: In medical image analysis tasks, annotated data is often scarce. Therefore, it is appealing to gain a better understanding of the effects of pre-trained weights when dataset size is small. In this subsection, we compare the model performance of 2x, 6x training from scratch and 2x with pre-training on variable dataset sizes by randomly choosing 20%, 40%, 60% and 80% of the whole training data (Table 5). Pre-training based models achieve better performance with less training time on all the cases, and the smaller the size of the dataset, the larger the gap. A dramatic drop of performance starts when training with only 40% of the whole data. And when training with only 20% of the dataset, which is around 4,500 images, the model trained with our proposed pre-trained weights achieves an absolute performance gain of 6.57% on MAP@0.5, accounting for an 11% relative gain.

4 Conclusions

In this paper, we propose a generic model architecture to exploit 3D network for 2D lesion detection in CT slices. The proposed MP3D FPN can reduce computation and memory cost while providing enhanced 3D context modeling ability. A simple yet effective way for 3D network pre-training is also derived to facilitate efficient training. Without sophisticated structures and multi-supervision signals, it significantly improves the detection performance on the DeepLesion dataset, surpassing all the SOTAs. We have proved the benefits of pre-trained weights for variable dataset size, and we expect that the MP3D ResNet along with its pre-trained weights can serve as a benchmark backbone for 3D medical image analysis, making contributions towards accessible precise medication.