Keywords

1 Introduction

Lung cancer is notoriously aggressive with a low long-term survival rate [1]. Quantitative analysis in lung nodules using thoracic Computed Tomography (CT) has been a central focus for early cancer diagnosis, where CT phenotype provides a powerful tool to comprehensively capture nodule characteristics [2].

The importance of diagnostically classifying malignant and benign nodules using CT images is to facilitate radiologists for nodule staging assessment and individual therapeutic planning. Despite different approaches proposed for nodule analysis such as parametric texture feature extraction [4, 8, 13], they are problematic in finding well-suited parameters for robust analysis. It comes without doubt that, technical challenges still remain in defining and extracting quantitative features from clinical images for improving image-guided disease diagnosis. Furthermore, prior studies mostly focused on nodule morphology [5, 19], which may not be able to provide an accurate description of the nodule. For example, the definition of nodule boundaries is obscure and subjective—inter-reader variability from radiologists makes precise nodule delineation a challenging task. In view of these challenges, the following specific questions arise: (a) What should be done to learn discriminative features from heterogeneous nodule data for representing different diagnostic groups? (b) How could one design a robust framework that is capable of extracting quantitative features from original nodule patches—instead of segmented nodules—that is advantageous in completely eliminating onerous preprocessing steps such as a nodule segmentation?

Fig. 1.
figure 1

An overview of the MCNN. Our approach first extracts multiple nodule patches to capture the wide range of nodule variability from input CT images. The obtained patches are then fed into the networks simultaneously to compute discriminative features. Finally, our approach applies a classifier to label the input nodule malignancy.

In this paper, we study the problem of lung nodule diagnostic classification based on thoracic CT scans. In contrast to the current methods primarily relying on nodule segmentation and textural feature descriptors for the classification task, we propose a hierarchical learning framework to capture the nodule heterogeneity by utilizing Convolutional Neural Networks (CNN) to extract features (as illustrated in Fig. 1). The learned features can be readily combined with state-of-the-art classifiers (e.g., Support Vector Machine (SVM) and Random Forest (RF)) for related Computer-Aided Diagnoses (CADs). Our method achieves 86.84 % accuracy on nodule classification using only nodule patches. We also observe that the proposed method is robust against noisy corruption—the classification performance is quite stable at different levels of noise inputs, indicating a well generalized property.

Contributions. We introduced an MCNN model to tackle the lung nodule diagnostic classification without delineation on nodule morphology and explored a hierarchical representation from raw patches for lung nodule classification. Our methodological contribution is three-fold:

  • Our MCNN take multi-scale raw nodule patches, rather than the segmented regions, providing evidence that information gained from the raw nodule patches is valuable for lung nodule diagnosis.

  • Our MCNN remove the need of any hand-crafted feature engineering work, such as nodule texture, shape compactness, and nodule sphericity. The MCNN can automatically learn the discriminative features.

  • Although it is challenging to directly deal with noisy data in nodule CT, we show that the proposed MCNN model is effective in capturing nodule characteristics in nodule diagnostic classification even with a high-level noisy corruption.

Related Work. Image-based lung nodule analysis is normally performed with nodule segmentation [5], feature extraction [2], and labelling nodule categories [8, 17, 19]. Way et al. [19] first segmented the nodules and then extracted texture features to train a linear discriminant classifier. El-Baz et al. [5] used shape analysis for diagnosing malignant lung nodules. Han et al. [8] used 3-D texture feature analysis for the diagnosis of pulmonary nodules by considering extended neighbouring structures. However, all of these mentioned methods relied on nodule segmentation as a prerequisite for nodule feature extraction. Notably, automated nodule segmentation can affect classification since segmentation usually depends on initialization, such as region growing and level set methods. Working on these segmented regions may yield inaccurate features that lead to erroneous outputs.

Descriptors of Histogram of Oriented Gradients (HOG) [4] and Local Binary Patterns (LBP) [13] are widely used for feature representation in medical image analysis. However, it is known that they are domain agnostic [15]. In other words, the required hyper-parameters make these approaches sensitive to specific tasks. For example, a repetitious parameter tuning is needed for the neighbourhood points in LBP and the size of the cell window in HOG.

Our work is conceptually similar to the massive training artificial neural network [17], which suggested a feasibility on learning knowledge from artificial neural networks. However, the work was an integrated classifier that required extra support from a 2-D Gaussian distribution for the decision-making, where an image-to-image mapping based on local pixels was learned. Our approach, without knowing any extra distributions, aims at feature extraction globally from the original nodule image space through stacked convolutional operations and max-pooling selections. In contrast to [17], our work is more computationally effective in reducing the feature dimensionality and resulting in highly discriminative features from hierarchical layers.

2 Learning Multi-scale Convolutional Neural Networks

Given a lung nodule CT image, our goal is to discover a set of globally discriminative features using the proposed MCNN model, which captures the essence of class-specific nodule information. The challenge is that the image space is extremely heterogeneous since both healthy tissues and nodules are included. In this work, we make full use of the CNN to learn discriminative features, and build three CNN in parallel to extract multi-scale features from nodules with different sizes. Details are given in this section.

2.1 Convolutional Neural Networks Architecture

Our Convolutional Neural Networks contain two convolutional layers, both of which are followed by a max-pooling layer, and a fully connected layer which represents the final output feature. The detailed structure of the network is shown in Fig. 2. From the input nodule patch to the final feature layer, the sizes of feature maps keep decreasing, which helps remove the potential redundant information in the original nodule patch and obtains discriminative features in nodule classification.

Fig. 2.
figure 2

The structure of the Convolutional Neural Networks learned in our work. The numbers along each side of the cuboid indicate the dimensions of the feature maps. The inside cuboid represents the 3D convolution kernel and the inside square stands for the 2D pooling region. The number of the hidden neurons in the final feature layer is marked aside.

The network starts from a convolutional layer, which convolves the input feature map with a number of convolutional kernels and yields a corresponding number of output feature maps. Formally, the convolution operation between an input feature map f and a convolutional kernel h is defined by:

$$\begin{aligned} y=\max \left( 0, \sum _cf_c*h_c+b \right) \!\!, \end{aligned}$$
(1)

where \(f_c\) and \(h_c\) denote the cth slice from the feature map and that from the convolutional kernel respectively, and b is the bias scalar. \(*\) is the convolution operation. Both h and b are continuously learned in the training process. In order to perform a non-linear transformation from the input to the output space, we adopt the rectified linear unit (ReLu) non-linearity in Eq. 1 for each convolution [11]. It is expressed as \(y=\max (0, x)\), where x is the convolution output.

Following the convolutional layer, a max-pooling layer is introduced to select feature subsets. It is formulated as

$$\begin{aligned} y_{(i,j)}=\max _{0\le m, n \le s}\{x_{(i\cdot s+m,j\cdot s+n)}\} , \end{aligned}$$
(2)

where s is the pooling size and x denotes the output of the convolutional layer. An advantage of using the max-pooling layer is its translation invariability which is especially helpful when different nodule images are not well-aligned.

Fig. 3.
figure 3

Nodule slice examples from a benign nodule patch (a) and a malignant nodule patch (b). The scales are \(96 \times 96 \times 96\), \(64 \times 64 \times 64\), and \(32 \times 32 \times 32\) in pixel respectively.

2.2 Multi-scale Nodule Representation

Our idea of the multi-scale sampling strategy is motivated from the clinical fact that nodule sizes vary remarkably, ranging from less than 3 mm to more than 30 mm in the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) [3] datasets. In the proposed MCNN architecture, three CNN that take nodule patches from different scales (as shown in Fig. 3) as inputs are assembled in parallel. We briefly refer to the three CNN as \( CNN _{ 0 }\), \( CNN _{ 1 }\), and \( CNN _{ 2 }\). In order to reduce the parameters of the MCNN, we follow the setting in [6] to share parameters among all the CNN. The resulting output of our MCNN is the concatenation of the three CNN outputs, forming the final discriminative feature vector, which will be directly fed to the final classifier without any feature reduction. We also follow the idea of deeply supervised networks (DSN) in [12] to construct our objective function. Unlike the traditional objective function in CNN, DSN introduced “companion objectives” [12] into the final objective function to alleviate the vanishing gradients problem so the training process can be fast and stable. The entire objective function is thus represented as

$$\begin{aligned} F(W)=P(W) + Q(W) , \end{aligned}$$
(3)

In our work, \(P(W)={LOSS}(W, w^{(out)})\) is the overall hinge loss function for the concatenated feature layer, and \(Q(W)=\sum _{m=1}^{M}\alpha _m loss(W,w^{(m)})\) is the sum of the companion hinge loss functions from all CNN. \(\alpha _m\) is the coefficient for the mth CNN. W denotes the combination of the weights from all of the CNN, while \(w^{(m)}\) and \(w^{(out)}\) are the weights of the feature layer of the mth CNN and the weights of the final concatenated feature layer respectively. In this way, F(W) keeps each network optimized and also makes the assembly sensible. Figure 4 shows the concatenated features projected into a 2-D subspace. It shows that the proposed MCNN model is able to remove the redundant information in the original images and extract discriminative features.

Fig. 4.
figure 4

Feature visualization. The learned features by the MCNN from both training set (a) and test set (b) are illustrated by projecting them into a 2-D subspace with principal component analysis (PCA) [10].

3 Experiments

3.1 Datasets and Setup

We evaluated our method on the LIDC-IDRI datasets [3]. It consists of 1010 patients with lung cancer screening thoracic CT scans as well as mark-up annotated lesions. The nodules are rated from 1 to 5 by four experienced thoracic radiologists, indicating an increasing degree of malignancy (1 denotes low malignancy and 5 is high malignancy). In this study, we included nodules along with their annotated centers from the nodule reportFootnote 1. We chose the averaged malignancy rating for each nodule as in [8]. For those with an average score lower than 3, we labelled them as benign nodules; for those with an average score higher than 3, we labelled them as malignant nodules. We removed nodules with ambiguous IDs and those with an average score of 3. Overall, there were 880 benign nodules and 495 malignant nodules. Since the resolution of the images varied, we resampled those images to set the resolution to a fixed 0.5 mm/pixel along all three axes. Thus, the effect of resolution on the classification performance was removed. Then each nodule patch is cropped from the resampled CT image based on the marked nodule centers. The three scale inputs are \(32 \times 32 \times 32\), \(64 \times 64 \times 64\), and \(96 \times 96 \times 96\) in pixels. Patches are all resampled to \(32 \times 32 \times 32\) so that they can be uniformly fed into each CNN.

3.2 Implementation Details

We used a 5-fold cross validation for evaluating classification performance based on features learned from the MCNN. During each round of validation, there were originally 1100 nodules (704 benign nodules and 396 malignant nodules) in the training set and 275 nodules (176 benign nodules and 99 malignant nodules) in the test set. To enlarge the training samples to train the MCNN, we augmented both benign nodules and malignant nodules by translating the nodule patches along three axes with \(\pm 2\) pixels as in [16]. Thus, each patch was translated 6 times. Such a setting helps capture a range of translation invariant features. Note that the number of benign nodules is almost twice as large as that of malignant nodules. Thus, for the purpose of balancing the datasets, all of the malignant nodules were augmented and only half of the benign nodules were selected for augmentation, resulting in 5588 ((396 + 704/2)\(\,\times \,\)6 + 396 + 704) multi-scale nodules in the training set. Considering that the three CNN share the same parameters, the equivalent number of total augmented nodules can be 16,764. The test set always remained its original number of 275 nodule samples at each validation round.

To systematically evaluate the performance of the MCNN, we covered different network configurations, i.e. different numbers of convolutional kernels of each convolutional layer and that of the hidden neurons in the feature layer. The numbers of the convolutional kernels were \(n_1=\{50, 100\}, n_2=\{50, 100\}\) for the first and second layer and the number of neurons in the hidden layer was \(n_3=\{20, 50\}\). Therefore, there were 8 configurations in total for the MCNN. Note that we set \(\alpha _m = 0.001\) for all m as found best in [12]. The convolutional kernel size is \(5\times 5\times k\) which is quite typical in traditional CNN. k represents the third dimension of the input feature map. The pooling size was fixed to a \(2\times 2\) window. We added L2 norm weight decay during the training process to relieve overfitting. Two classifiers were used in experiments including SVM with a Radial Basis Function (RBF) kernel and RF. The hyper-parameters in both SVM and RF were obtained via a grid search on the training set.

We compared our results with two competing methods including HOG and LBP descriptors. For the HOG descriptor, we included different cell window sizes, \(s_w=\{8, 16, 32\}\) with the number of orientations \(n_o=8\). For the LBP descriptor, the uniform LBP descriptor was computed with different neighbourhood points \(n_{pt}=\{8,16,24\}\). Computation was done on all three scales of nodule patches for both descriptors.

Speaking of time complexity, although training deep networks often takes time, we choose a strategy of off-line training and on-line testing. In other words, once the network is finely trained off-line, it will be fast when a new sample comes in. In our study, using the NVIDIA Tesla K40 GPU, the test time for a single nodule patch was within 0.1 s. The CNN implementation used in this work was the deep learning toolkit CAFFE [9]. The classifiers of SVM and RF were from the scikit-learn package [14]. The HOG descriptor and the uniform LBP descriptor were from the scikit-image package [18].

3.3 Binary Nodule Classification Results

In this section, we evaluated the binary nodule classification. We used the average accuracy to observe the classification performance, i.e. the average ratio of corrected classified nodules from both classes from a 5-fold cross validation. Note that in the test set during each round of cross validation, 176 benign nodules and 99 malignant nodules made a baseline accuracy of 64 % by voting the majority class. From Fig. 5, it was immediately observed that the proposed MCNN showed competitive results above 84 % with different configurations. The highest classification accuracy was obtained with the RF classifier with 86.84 % under the configuration of \(n_1=100\), \(n_2=100\) and \(n_3=50\) (see Fig. 2). The overall performance of both classifiers suggested that our method can achieve promising results. The advantages can be ascribed to a factor that the hierarchical learning strategy selected high-level features, eliminating a number of redundant features. As already proven in [7], convolutional networks can produce useful dimensional reduction that is very helpful for image-related classification.

Fig. 5.
figure 5

The classification performance of SVM with the RBF kernel and RF based on features from the MCNN using 8 different configurations. Each configuration is assigned to a unique ID for display convenience.

Accuracy of the HOG and LBP descriptors are shown in Table 1 (numbers in bold denote the best results in columns). It was apparent that the HOG descriptor was quite sensitive to the size of the cell window (\(s_w\)). The results of the HOG descriptor dropped and were even worse than the baseline when \(s_w\) expanded, indicating that the information gained is minimal when the size was becoming larger. For the LBP descriptor, we observed that the number of neighbourhood points (\(n_{pt}\)) was positively related to the performance since sophisticated neighbourhood structures led to better results. However, when comparing best results among different approaches, our MCNN outperformed HOG and LBP descriptors with 10.91 % and 13.17 %, respectively. Overall, our observation confirmed that parametric textural descriptors were sensitive to parameters.

Table 1. Performance using the HOG and LBP descriptors with different \(s_w\) and \(n_{pt}\).

3.4 Robustness to Noise Corruption

We further trained and tested the proposed MCNN on the challenging noisy data. In particular, we imposed a Gaussian noise to the original CT data. Different levels of noises were considered, including a mean value \(\mu =0\) and different standard deviations \(\sigma =\{0.5, 1.0\), \(2.0 \}\) (as shown in Fig. 6). As seen in Table 2, it was surprising that the MCNN still achieved 83.56 % and 83.27 % with \(\sigma = 2.0\), indicating the robustness of the MCNN against noisy inputs. The success could be probably explained by the fact that the max-pooling layers, which use the selective downsampling strategy, “filter out” noisy outliers, rendering the network robust to corrupted information. Therefore, the performance reaffirmed that the MCNN were capable of finding specific patterns that were inherently associated with different nodule classes.

Fig. 6.
figure 6

Slice demonstration with different levels of noise. Gaussian noises with a mean value \(\mu =0\) and different standard deviations were imposed.

Table 2. Performance of the MCNN under different levels of noise

3.5 Discussion

We have shown promising results of the proposed MCNN framework on classifying diagnostic nodule classes. Convolutional network is a powerful tool for image analysis because its capacities can be easily adjusted to a specific task and it makes strong and mostly correct assumptions about the nature of images [11]. In our study, although it suggested a clear need to further investigate the appropriate scales of nodules that lead to improved performance, we experimentally found that even with a single scale, the results remained competitive (for \( CNN _{ 0 }\), \( CNN _{ 1 }\), and \( CNN _{ 2 }\), it achieved 86.12 %, 83.88 %, and 79.00 % respectively). However, using the multi-scale strategy eliminated the careful designing of the patch sizes, which could be a tedious work. Rather than using isotropic kernels, we kept the third dimension of the kernels to be the same with the dimension of the input feature map which is quite common in the intermediate layers of conventional CNN. It also enabled us to directly use CAFFE out of the box.

4 Conclusion

In this paper, we proposed a Multi-scale Convolutional Neural Networks (MCNN) architecture to tackle the challenging problem on learning from lung nodule patches for nodule diagnostic classification. We demonstrated that the learned compact features are able to capture nodule heterogeneity. It is particularly promising that the MCNN model is robust against noisy inputs, which is valuable in the medical image analysis field. Extensive experiments showed that our method achieved 86.84 % for nodule classification and outperformed competing benchmark textural descriptors. In future work, we plan to expand data inclusion for a large-scale evaluation, and we will perform an investigation to seek appropriate scales for improving image-guided nodule analysis.