Keywords

1 Introduction

As reported in global cancer statistics in 2012, nearly 1.83 million new cases of lung cancer occurred, with a fatality rate of up to 82% [26]. Since early lung cancer has no obvious symptoms, the time of clinical diagnosis often reaches the middle and late stages which leads to the high cost and bad treatment. Therefore, early detection and diagnosis of lung cancer are particularly important. Especially, in the initial stage, lung nodule detection is worth of attention to give patients the best chance of recovery and survival. The National Lung Screening Trial has shown that, screening with low-dose computed tomography (CT) images reduces overall mortality among individuals at high risk for developing lung cancer [25]. Thus, millions of medical CT images are needed to be further analysed by radiologists, however, which consumes lots of time and effort.

Fig. 1.
figure 1

Examples of pulmonary nodules on the right green rectangle and false positives on the left red rectangle. The right part displays the various sizes, shapes, locations and types of the diseased nodules. False positive candidates that are not related with cancers have quite similar morphological appearance to the true nodules. (Color figure online)

To speed the reading process and reduce the burdens of radiologists, computer-aided diagnosis system (CADs) has been a prosperous field in medical image processing. The predictions of CAD systems are used as the second diagnosis result before making final decision. In the diagnosis of lung cancer, automated pulmonary nodule detection in CT scans plays an important role in CAD systems [12, 21, 28]. Although the improvement of CAD systems has been proved, the use of CADs is still a challenging task because most nodules are not found at low false positive rates. The pulmonary nodules with various sizes, shapes, locations, types and false positive candidates with similar morphology are shown in Fig. 1. As we can see, the false positive candidates which carry similar appearance with nodules would heavily make the task challenging. Recently, the revolution of convolutional neural networks (CNNs) has attracted plenty of academic groups and industries to pay their attention to the extraordinary learning power of deep learning in CADs. In this paper, we propose a novel automated pulmonary nodule detection and classification framework to assist the CT reading process. The main contributions of this paper are as follows:

  • To adapt to the detection task, we design the structure of Faster R-CNN with two region proposal networks and a deconvolutional layer. To integrate more spatial information of the lung scans, three detection architectures are trained for three kinds of adjacent slices respectively, and then results fusion is conducted.

  • For false positive reduction, a set of 2D patches from multiple view planes are used to train three 2D CNNs. To boost the sensitivity, we keep the mis-classified samples and use them to re-train next model. Finally, the classification results are voted out.

The performance of our system is validated on LUNA16 dataset [1]. For nodule candidate detection, the sensitivity achieves 86.42%. For the false positive reduction, the sensitivities can reach 73.4% and 74.4% at 1/8 and 1/4 FPs/scan respectively. It illustrates that our system achieves promising results for automated pulmonary nodule detection and classification.

2 Related Work

Current automated pulmonary nodule detection systems mainly consist of two stages: (1) nodule candidate detection [6, 8]; (2) false positive reduction [19]. In nodule candidate detection stage, a number of candidates are screened using some hand-crafted features such as morphological features, voxel clustering and pixel thresholding [29]. Some researchers use convolutional neural network to describe the characteristic of nodules, and then the candidate regions are obtained. In false positive reduction stage, a classifier is designed to reduce a large number of false positive candidates.

Recently, CAD systems based on deep convolutional neural network are designed for automatic lung cancer detection [14]. For candidate detection, ZNET system applys CNN which is generated from the probability map given by U-Net [17] on each axial slice. Then 64 \(\times \) 64 patches from the axial, sagittal and coronal views are extracted for each candidate. Each patch is processed separately by the wide residual networks. The predicted output values of the network for these three different patches are averaged to obtain the final prediction. Traverso et al. [27] propose a WEB and Cloud-based CAD system, which is the combination of two independent CAD sub-systems: the Channeler Ant Model (lungCAM) and the Voxel-Based Neural Approach (VBNA). These two algorithms have a common starting point, which is the parenchymal volume obtained with a 3D region growing segmentation algorithm.

Because of the 3D nature of CT scans, some researchers propose 3D convolutional networks to handle the challenge. Dou et al. [15] propose a nodule detection framework based on 3D CNN, which screens the candidates with the fully convolutional network, and retrieves the high-probability locations as candidates. In false positive reduction, they employ the residual network which can ease the gradients flow within the network. Hangzhou Jian Pei science and Technology Co., Ltd. [2] proposes a multi-scale rule-based screening method to obtain nodule candidates. The false positive reduction uses 3D CNN with wide channels, which is trained using data augmentation to prevent overfitting. Zhu et al. designed a 3D Faster Regions with Convolutional Neural Net (R-CNN) for nodule detection with 3D dual path blocks and a U-net-like encoder-decoder structure to effectively learn nodule features. For nodule classification, gradient boosting machine (GBM) with 3D dual path network features is proposed [30].

Although most CAD systems achieve encouraging performance based on 3D CNN, it requires more training time and the model size also has quadratic growth compared to 2D CNN. For example, the model size of a 11-layer 3D CNN, i.e., C3D [9] networks, is 321 MB which is even larger than that (235 MB) of a 152-layer 2D ResNet (ResNet152) [11], making it extremely difficult to train a very deep 3D CNN. Besides, the CT scans usually have different slice thicknesses (0.6–5 mm), so the preprocessing of 3D lung CT images is complicated [13]. On the contrary, 2D lung CT images are not influenced by the slice thickness. Both training time and the resources needed for processing are less. Hence, using 2D CNN networks is a more widespread way to detect the lung nodules.

3 Method

We bring up an automated pulmonary nodule detection system, which consists of two stages: nodule candidate detection and false positive reduction. Two 2D deep convolutional networks are designed for nodule detection and classification. Our candidate detection framework is based on Faster Region-based Convolutional Neural Network (Faster R-CNN) [16] which is a well-known approach for object detection. The goal is to identify locations of possible nodules at a very high sensitivity, which means that some false positives will be generated. Therefore, we propose a boosting classifier base on 2D CNN to reduce these false positive candidates. The whole framework is illustrated in Fig. 2. The method is described in detail as follows.

3.1 Nodule Candidate Detection

Inspired by the successful use of DCNNs in object recognition [16], we design a detection structure based on Faster R-CNN. In the upper part of the Fig. 2, it presents the nodule candidate detection network. Using raw CT scans, the aim is to detect nodule candidates, and assign a probability for being a nodule to each location. To make fully use of the spatial information of the CT scans, we extract not only the middle slice of the nodule, but also two neighboring slices. Three 2D CNNs are trained with these three kinds of slices separately. When testing, the detection results are merged to get the nodule candidates regions. The network is composed of three sub-networks: (1) feature extraction network; (2) region proposal network; (3) Region-of-Interest classifier.

Fig. 2.
figure 2

The architecture of our CAD system consists of two parts. The top part is nodule detection network which is composed of three sub-networks. The basic feature extraction network is VGG16, and a deconvolution layer is used to enlarge the feature map. Meanwhile, two region proposal networks with designed seven anchors are applied to obtain the proposals. The sizes of seven anchors are: 12 \(\times \) 12, 18 \(\times \) 18, 27 \(\times \) 27, 36 \(\times \) 36, 51 \(\times \) 51, 75 \(\times \) 75 and 120 \(\times \) 120. Finally, Region-of-Interest classifier is employed to get the candidates. For nodule candidates, we conduct false positive reduction shown in the bottom part. Firstly, we use different methods of data preprocessing for positive and negative samples. Then, in the boosting classifier part, training data of each model comes from two parts: training subsets and mis-classified data from the last model. Finally, the results are voted out through three classification CNNs.

Firstly, for feature extraction network, our method is VGG16 [20] with 5-group convolutions, which is shared by the subsequently sub-networks. Owing to the much smaller size of lung nodules compared with common objects in natural images, the selection of receptive field is important for the pulmonary nodule candidates detection. After a series of convolution and pooling layers, the receptive field becomes larger. The size of the feature map in the last convolution layer of VGG16 (conv5_3) is 38 \(\times \) 38, which leads to a limited performance in detecting RoIs of nodules, because the small feature map cannot clearly represent the features of objects. To conduct region proposals in a feature map, we add a deconvolution layer to obtain a 148 \(\times \) 148 feature map after conv5_3 inspired by [6].

Secondly, two region proposal networks are applied to integrate lower layer information [10]. These two region proposal networks are concatenated to deconvolution layer and the middle convolution layer (conv3_3) respectively, which have approximately same size of feature maps. We can obtain different useful information of the nodule by using different perspectives of two region proposal networks.

Thirdly, to fit the size of nodules, seven anchors with different sizes are designed: 12 \(\times \) 12, 18 \(\times \) 18, 27 \(\times \) 27, 36 \(\times \) 36, 51 \(\times \) 51, 75 \(\times \) 75 and 120 \(\times \) 120. A 3 \(\times \) 3 sliding window is applied to the feature maps (deconvolution layer and conv3_3), and the designed anchors are employed to predict multiple RoIs at each location of sliding window. Then, the region proposal network maps each 3 \(\times \) 3 sliding window to a 512-d feature, which is fed into two sibling fully-connected layers for regressing the bounding box of region and predicting score simultaneously. With these definitions, the multi-task loss for an image is defined as:

$$\begin{aligned} L(p_i,t_i,p_{kj}^{1},t_{kj}^1)=\sum _{i}L_1(p_i,t_i)+\sum _{k=1}^{2}\sum _{j}L_2(p_{kj}^1,t_{kj}^1), \end{aligned}$$
(1)

Here, \(L_1\) and \(L_2\) can be written as:

$$\begin{aligned} L_{1}(p_i,t_i)=L_{cls}(p_i,p_{i}^{*})+\lambda p_{i}^{*}L_{reg}(t_i,t_{i}^{*}). \end{aligned}$$
(2)
$$\begin{aligned} L_{2}(p_{kj}^1,t_{kj}^1)=\frac{1}{N_{cls}}L_{cls}(p_j^1,p_{j}^{*})+\lambda \frac{1}{N_{reg}}p_{j}^{*}L_{reg}(t_j^1,t_{j}^{*}). \end{aligned}$$
(3)

where i is the index of proposals produced by region proposal networks, \(p_i\) is the predicted probability of proposal i being a nodule. The ground-truth label \(p_i^*\) is 1 if the proposal is positive, otherwise 0. \(t_i\) is a vector representing the 4 parameterized coordinates of the predicted bounding box, and \(t_i^*\) is that of the ground-truth box associated with a positive proposal. The classification loss \(L_{cls}\) is log loss over two classes (nodule vs. not nodule). j is the index of an anchor which is chosen as a training sample in an region proposal network training mini-batch, k is the index of the two region proposal networks, \(p_{kj}^1\) and \(t_{kj}^1\) are similar to the symbols mentioned above but in the k-th region proposal network, parameter \(\lambda \) controls the balance between \(L_{cls}\) and \(L_{reg}\) and \(\lambda \) is set to 1 in all of our experiments.

3.2 False Positive Reduction

For the acquisition of candidate nodules, isomorphic sampling is conducted before extracting 2D patches. The pixel size and coarse granularity of surfaces scanned from multifarious medical devices are different. So in our system, all the objects are sampled in 1 \(\times \) 1 \(\times \) 1 (mm) pixels. Similar to [19], nine patches on multi-view planes are extracted which is shown in the nodule augmentation parts of Fig. 2. Three planes are known as sagittal, coronal, and axial planes, and the rest of planes cuts two opposite faces of cubes in diagonals. We can obtain more context information about one nodule and alleviate the ratio between positive and negative samples. However, for non-nodules, we only conduct downsampling using 2D CNN firstly which is detailed in Sect. 4.1.

Boosting [18] is a commonly used statistical learning method which is effective and has a broad application. Combined with this classifier idea, several CNNs are trained and the final result is obtained by voting. We divide the dataset into 5 subsets: 3 subsets for training, 1 subset for validation and 1 subset for testing. And then the training subset is divided into 3 parts, and each part is used to independently train the classification model. In the experiment, the input of the network is a 35 \(\times \) 35 patch. We adjust and fine-tune the AlexNet [14] based on a pilot study on a smaller dataset. We refer readers to [3] for more network architecture details. The first subset is employed to train a weak classification model1, and then the misclassified samples from the model1 and second subset are used to independently train a new model2 from scratch. Similarly, model3 is independently trained with wrong data from model1 and model2 and third subset. In the training process, we take the misclassified samples of previous round as the training data for the next model and the networks pay attention to those samples that are not easy to classify. To discriminate the hard mimics correctly, it can be inferred that hard mining is important for improving the accuracy of CNN.

4 Experiments and Results

4.1 Dataset

LUNA16 dataset is collected from the largest publicly available reference database for pulmonary nodules: the LIDC-IDRI [5], containing a total of 1018 CT scans. LUNA16 dataset removes CTs with slice thickness greater than 2.5 mm, slice spacing inconsistent or missing slices from LIDC-IDRI dataset, and also removes the annotated nodules of size smaller than 3 mm. The remaining 888 scans are divided into 10-folds with the objective to perform cross validation. The total candidates in LUNA16 are 754, 976, and the corresponding class label (0 for non-nodule and 1 for nodule) for each candidate is provided. Note that there can be multiple candidates per nodule. For false positive reduction stage, there is a challenge in the dataset: a serious imbalance between the false positive candidates and the true nodules (approximately 500:1). In the data preprocessing phase, we randomly choose the same number of positive and negative samples to train a small CNN. Only the mis-judged negative samples are left, and we conduct downsampling with these false positives. For true nodules, we adopt image translations and horizontal reflections on image data similar to [14]. In addition, we also extract different orientational patches of the same true nodule.

4.2 Evaluations Metrics

In the binary classification problem of false positive reduction stage, we use the area under the ROC curve (AUC) to show the performance of ConvNets. In addition, we adopt the same evaluation metrics as LUNA challenge. The performance of our method is evaluated based on the results of cross validation using the Free Receiving Operating Curve (FROC) and competition performance metric (CPM). The sensitivities are measured at: 1/8, 1/4, 1/2, 1, 2, 4 and 8 FPs per patient. Sensitivities at those particular points are averaged to get CPM for the system.

4.3 Results

For training nodule candidate detector, we perform 10-fold cross validation on LUNA16 dataset with given patient-level split. The number of iterations is 100000 in total with stochastic gradient descent optimization and momentum as 0.9. We use weight decay as 0.0005, and the base learning rate is 0.001. The FROC curve of nodule candidate detection is visualized in Fig. 3. The solid line is interpolated FROC based on true prediction, and the dash lines are upper bound and lower bound for the bootstrapped FROC performance. The sensitivity of the nodule candidate detection method achieves 86.42%. For the convenience of description, Table 1 shows the sensitivities measured at: 1/8, 1/4, 1/2, 1, 2, 4 and 8 FPs per scan and the CPM score of several other methods [4]. The average number of candidates per scan is 4.67. As can be seen, the result of this implementation is competitive in comparison with other traditional methods, although the methods based on 3D CNNs are better than the methods based on 2D CNNs. But 3D CNNs need much resource and time. Note that, we have better performance than other teams which utilize variants of 2-D CNNs.

Table 1. Results of the nodule candidate detection.
Table 2. Results of the false positive reduction track in LUNA16 challenge.
Fig. 3.
figure 3

Sensitivity (Recall) rate of nodule detection network with respect to false positives per scan. The CPM score (average recall rate at the false positives as 0.125, 0.25, 0.5, 1, 2, 4, 8) is 77.5%. The proposed nodule candidate detection has a recall rate 86.4% for all the nodules. The dash lines are lower bound and upper bound FROC for 95% confidence interval using bootstrapping with 1,000 bootstraps [4]. The solid line is the interpolated FROC based on prediction.

Fig. 4.
figure 4

FROC of boosting 2D CNN CAD systems on LUNA16. The solid line is interpolated FROC based on true prediction, and the dash lines are upper bound and lower bound for the bootstrapped FROC performance.

We perform evaluation in 5-fold cross-validation across selected 888 LIDC-IDRI cases. The pixel intensity range (−1000, 4000 Houndsfield Unit) is rescaled to (0, 1), and we subtract the mean gray-scale value to fit the distribution of training and testing data. The mini-batch size is set to 256, the momentum [23] is set to 0.9, and the dropout [22] is set to 0.5 which is implemented in convolutional and fully connected layers as regularization. The FROC curve of our 2D CNN CAD system is presented in Fig. 4. The area under ROC (AUC) score is as high as 0.954 when we apply our 2D CNN to nodules/non-nodules classification task for a set of candidates. It is observed that the average detection sensitivity is 0.790 at seven operating points. There were seven teams participating in the ISBI challenge,and the sensitivities of teams under different false positive rates are listed in Table 2. In these 2D approaches, we have achieved good results. For example, sensitivity of 0.734 and 0.744 at 1/8 and 1/4 FPs/scan are obtained, respectively. It also proves that our system can maintain a satisfactory sensitivity even with extremely low false positive rates.

5 Conclusion and Discussion

In this paper, a novel and effective pulmonary nodule detection system in CT scans based on 2D CNNs is proposed. For the detection stage, we improve the architecture of Faster R-CNN to detect small pulmonary nodules. Then a boosting based classifier is trained to reduce the false positive candidates detected by the first stage. Experiments are conducted on LUNA16, and it demonstrates that our system can accurately detect the latent pulmonary nodules. In future work, we will still improve the architecture of our system and evaluate it on more medical images with the aid of radiologists and surgeons.