1 Introduction

Lung nodule characterization is the most difficult step in the pipeline of lung cancer diagnosis according to radiologists, which can be observed by a great inter-observer disagreement on the task [7, 12]. A lung nodule is normally characterized with respect to texture, spiculation, lobulation, and its morphological appearance on a CT scan, and eventually it must be classified as either benign or malignant for patient management. The Lung imaging Reporting And Data System (Lung-RADS) [9] is a protocol that defines explicit guidelines for nodule management and follow up planning, and classifies pulmonary nodules in six categories, each of which has its own suggested follow up. Lung-RADS also integrates the PanCan Model [11], which provides a malignancy probability based on the morphology of a nodule and additional patient information. Certain diagnosis can only be made through biopsy, which, however, is invasive and not always feasible to have access to. While determining the malignancy of a nodule from its appearance on a CT scan is not a fail-proof method, it is still a very useful step of the lung cancer detection pipeline. It can have very important value to clinicians in conjunction with patient history and demographics.

Several deep learning methods have been proposed for automated nodule classification from CT. The publicly available Lung Image Database Consortium and Image Database Resource Initiative (LIDC) database [2, 10] has been in the core of the majority of such efforts. The LIDC does not primarily contain pathology confirmed ground truths (besides a very small subset of cases), but rather radiologists’ annotations. Nevertheless, it is still heavily used by the research community for the task of lung nodule classification. Interestingly, there are various design choices regarding sample selection that need to be considered, which can have severe impact on the reported results.

The contributions of this paper can be summarized as follows: 1) We analyze several published works reporting results on LIDC nodule classification and examine the different assumptions such as annotation aggregations methods, removal of cases based on clinical guidelines, and data augmentation, which all can affect the resulting sample selection process; 2) Through an extensive experimental analysis, we show that the selected data distribution can affect the difficulty of the task and may play an even more important role than the model architecture; 3) We demonstrate that reproducibility and direct model comparison is virtually impossible to achieve and provide suggestions towards making this feasible in future work, while also making our data selection publicly available to promote reproducibility. We illustrate the pitfalls of sample selection with a novel methodological approach of curriculum by smoothing for lung nodule classification. Our findings and insights will be of use to the community and aid in the design of future approaches for lung nodule classification.

2 State-of-the-Art in Lung Nodule Classification

The LIDC dataset contains more than 1000 scans. Each scan was reviewed by four radiologists who pinpointed lesion locations and assigned a variety of annotations including malignancy. For every nodule, each radiologist had to assign a malignancy rating from 1 (most likely benign) to 5 (most likely malignant). Nodules annotated with 3 were regarded as indeterminate.

Fig. 1.
figure 1

Lung nodule examples from the LIDC. Top row: Nodules that have the same consensus regardless of the aggregation method used. Bottom row: Nodules that have different consensus depending on the aggregation method.

There are a number of preprocessing and data curation steps which are considered fixed when using the LIDC and almost all recent deep learning papers follow them. These include (1) retaining only nodules that have been annotated by at least three radiologists and (2) discarding nodules annotated as indeterminate. Subsequently, for each nodule a consensus annotation is extracted from the individual annotations through some form of aggregation or truthing (typically using mean, median, or majority voting). Example nodules from the LIDC with different consensus/aggregation combinations can be seen in Fig. 1. Given these relatively straightforward steps, it may be surprising to find that every paper we studied reports largely varying numbers for benign and malignant nodules and overall cases (see Table 1). Most studies report that they follow a procedure similar to previous work, however, rarely provide the exact details about either the sample selection process or the final dataset (e.g. by publishing a list of scan series IDs). Beside the differences in absolute numbers of benign and malignant cases, the characteristics of the underlying data distribution may change significantly. One of the most important characteristics is the size of a nodule (quantified by its diameter), as it plays an essential role in malignancy classification. Another discrepancy arises from the decision to remove cases that have a slice thickness >2.5 mm, which is based on clinical guidelines [5]. Images with thick slices are deemed unsuitable for lung cancer screening. This step was first suggested in the LUNA16 nodule detection challenge [15] and has also been adopted by other studies [20]. One of the few works that release their pre-processed data is by Al-Shabi et al. [1].

Here, we attempt to draw a direct comparison to their work with the dataset we have extracted from pre-processing LIDC (see Fig. 2). Something like this is not feasible for the other proposed methods which do not publicly release their sample selection. In this comparison, we want to highlight the important role that the aggregation method (mean vs median) plays in determining which samples are labeled as benign and malignant. When median aggregation is used, we see that a lot more nodules have an indeterminate consensus (i.e. median=3) and are therefore excluded, resulting in a smaller, more balanced dataset, which is much easier to separate based on the key characteristic of nodule diameter. Specifically, median aggregation leads to 442/406 benign/malignant nodules for [1] and 376/357 benign/malignant in our replicated pipeline, respectively. In contrast, mean aggregation results in 653/484 benign/malignant for [1] and 559/451 for us. A factor leading to a discrepancy between the two samples, even when the same aggregation method is used, is the fact that cases with a slice thickness >2.5 mm have been retained by [1]. These factors make reproducibility and direct comparison of methods nearly impossible.

Table 1. Overview of previous work for lung nodule classification on LIDC-IDRI in terms of nodule counts and performance. Despite all papers using the same publicly available dataset, final numbers of benign and malignant cases vary largely making a direct comparison of the methods’ performance impossible.
Fig. 2.
figure 2

Data distributions of benign and malignant samples over nodule diameter. (a) Median aggregation from [1], (b) Mean aggregation from [1], (c) Our median aggregation, (d) Our mean aggregation. Median aggregation produces fewer nodules in total (i.e. more nodules are classified as indeterminate) for both cases, and at the same time more balanced datasets.

3 Methodology

Here we present different methods and approaches, including our attempted contribution, which we considered for studying the impact of sample selection on lung nodule classification performance. We used several baselines and state-of-the-art deep learning approaches.

3.1 Diameter-Based Baselines

Diameter Threshold. The first baseline we set is not learning-based but a rather simplistic one. Specifically, given that the size of a nodule is a primary factor in determining whether a nodule is malignant or not (i.e. large nodules are most likely to be regarded by experts as malignant, while small nodules as benign) we use the provided diameter annotation in LIDC and specify a threshold for classifying nodules into benign and malignant. This baseline is used as a surrogate to determine the difficulty of the classification, as the overall size difference between structures may be easily picked up by an image-based prediction model such as a convolutional neural network (CNN).

Regressed Diameter Threshold. Another baseline that we use is similar to the previous one but with a CNN that is trained to regress the diameter through a mean squared error loss. The classification is taking place by applying the threshold determined from the first baseline on the output of the CNN instead of the annotation. Again, if this baseline works well, one may conclude that the task given a specific dataset is not very difficult.

3.2 ShallowNet

We also implement a CNN for malignancy classification (termed ShallowNet), which is a bare-bones CNN comprising of four convolutional layers with kernels of shape 3x3 and ReLU activations, and corresponding max-pooling layers with kernels of shape 2x2, as well as a fully-connected layer with 1024 neurons at the end for the classification. This is a deliberately simplistic deep learning baseline used to compare with more complicated architectures proposed in the literature.

3.3 Local-Global

Since we have access to the sample selection of [1], it makes sense to use the state-of-the-art method on this distribution. The Local-Global network was proposed by [1] and consists of two blocks. Each block contains the following sequence: a residual sub-block [4] followed by a non-local sub-block [19] and a dropout layer. After the two blocks, there is an average pooling layer and a fully-connected layer for the classification.

3.4 Curriculum by Smoothing

Finally, we propose the use of curriculum by smoothing (CBS) [18], which has shown promising results on computer vision classification tasks. CBS plays the role of our attempted methodological contribution on lung nodule classification. The main idea behind CBS is to apply a Gaussian smoothing kernel to the output of each convolutional layer of a CNN. We use \(\theta \circledast x\) to denote the convolution of a kernel \(\theta \) with an input x. Typically, in a CNN, a convolution operation is followed by a non-linear activation function as described in Eq. 1:

$$\begin{aligned} z = \textit{activation}(\theta _{\omega } \circledast x) \end{aligned}$$
(1)

where \(\theta _{\omega }\) are the trainable parameters of a convolutional layer. The CBS formulation is presented in Eq. 2:

$$\begin{aligned} z = \textit{activation}(\theta _{G} \circledast (\theta _{\omega } \circledast x)) \end{aligned}$$
(2)

where \(\theta _{G}\) is a predefined Gaussian kernel. The Gaussian kernel is deterministic and is not trained. During the early stages of training it has an initial standard deviation \(\sigma \), which is annealed as training progresses. This way, high-frequency information is suppressed in the early training steps of the CNN and is only considered at later stages of the training process.

It is important to note that while we introduce CBS here as an approach that could enhance the performance of ShallowNet or Local-Global for the task of lung nodule classification, our purpose is not to propose a novel model architecture but rather to explore whether the selected sample distribution can play a more important role than the model architecture and highlight the pitfalls that occur in such a scenario.

4 Experimental Analysis

Following from the differences in the data distributions, we move to comparing some baseline models, as well as the proposed method from [1]. In this section we focus on two distributions to demonstrate the impact of sample selection and understand whether performance differences stem from the data or the methods. Specifically, we use the data produced with median aggregation (Fig. 2a) from [1] (henceforth denoted as \(\mathcal {D}_{1}\)), as this is the one the authors report results for, and mean aggregation (Fig. 2d) for our data (denoted as \(\mathcal {D}_{2}\)). We do not consider mean aggregation to be superior to median, but instead we want to study the differences in performance that are caused by this specific choice of truthing. Median aggregation leads to the two classes being more easily separated based on nodule diameter (Figs. 2a, 2c), even though 5–10 mm is considered the most difficult area to separate malignant from benign nodules. In both \(\mathcal {D}_{1}\) and \(\mathcal {D}_{2}\), a nodule is considered benign when the consensus has a value lower than 3 and malignant when it has a value greater than 3.

CT scans with a slice thickness greater than 2.5 mm are removed according to clinical guidelines [5] and every remaining scan is resampled to 1 mm isotropic resolution across all three dimensions and one 32x32 mm patch is extracted along each orthogonal plane at each nodule location. The final classification result for each nodule occurs from the averaging of the individual classification of each of its three planes. Some experiments include offline data augmentation (i.e. the size of the dataset itself is increased six-fold through the addition of nodule augmentations); these augmentations are the ones suggested by [1] and include rotations, horizontal flips and Gaussian smoothing. For the proposed methodological contribution of employing CBS we choose 3x3 sized kernels, with an initial standard deviation \(\sigma =1\) of the Gaussian smoothing kernel and an annealing of 0.5 every 5 epochs based on guidelines provided by the authors of [18] and our own validation performance. All models are evaluated using 10-fold cross validation and the reported results are the average of the performance across the 10 folds. The networks are trained using the Adam optimizer [6] with learning rate \(10^{-3}\) and binary cross-entropy loss for 50 epochs and a batch size of 256 samples. We also deploy early stopping to avoid overfitting. All experiments were conducted using PyTorch [13].

The results of the comparison can be found on Table 2. First, we show that even separating the samples based on nodule diameter (i.e. thresholding) can achieve a quite high accuracy (85.02% for \(\mathcal {D}_{1}\) and 83.46% for \(\mathcal {D}_{2}\)). In each case, we select the threshold that maximizes training accuracy. The threshold for the two cases is quite different (7.2 mm for \(\mathcal {D}_{1}\) and 11.5 mm for \(\mathcal {D}_{2}\)) because of the different aggregation methods used and also because the equivalent diameter (i.e. the diameter of the sphere having the same volume as the nodule estimated volume) is the one used in [1]. Then we use a shallow CNN (ShallowNet) to regress the nodule diameter and use a threshold (7.7 mm for \(\mathcal {D}_{1}\) and 11 mm for \(\mathcal {D}_{2}\)) on that, in order to classify the nodule. If we focus on \(\mathcal {D}_{1}\), we see that a ShallowNet trained directly on malignancy can initially just outperform the diameter-based baselines (85.74%) but its performance gets better progressively when we use either CBS (86.80%) or offline augmentations (89.74%) and reaches up to 90.91% if we use both. We observe the same pattern for Local-Global [1] which starts from 89.15% when we do not use CBS or augmentations and eventually reaches 90.91% when we use both. The progressive gains from CBS and augmentations that are present in \(\mathcal {D}_{1}\), however, are not replicated on \(\mathcal {D}_{2}\). All the methods in that case perform very similar to the diameter-based baselines with the ShallowNet being the only one that surpasses them marginally in terms of accuracy (84.35% with augmentations).

Table 2. Comparison of methods on the different data distribution settings. The reported results are averaged across the 10 folds. \(\mathcal {D}_{1}\) is the data distribution used in [1], which has occurred from median aggregation, while \(\mathcal {D}_{2}\) has been extracted from the LIDC by us using mean aggregation. We use accuracy (Acc), sensitivity (Sens) and specificity (Spec) to report the performance of each method and all the reported values are percentages (%). Even from the baselines, it is evident that \(\mathcal {D}_{1}\) is an easier task to solve than \(\mathcal {D}_{2}\). All methods perform better when augmented with CBS for \(\mathcal {D}_{1}\). In \(\mathcal {D}_{2}\) all configurations perform similarly to the diameter baseline, and there is no improvement from progressively increasing the complexity of the model by adding augmentations and/or CBS.

5 Discussion

The LIDC dataset has been instrumental for the majority of recent works on lung nodule classification. Here, we take a critical look at the aspect of sample selection after discovering inconsistencies in the reported literature. We aimed to examine different factors that affect the performance of a model and thus the apparent value of its methodological contribution. Starting from the pre-processing steps that various studies have applied on the LIDC dataset, we observe that a number of different assumptions during the sample selection process can lead to very different resulting data distributions (Table 1). Such factors are the choice of the aggregation method (e.g. median or mean), in order to extract a consensus from the multiple annotations per nodule, or the removal of certain cases which are considered as unsuitable for the task due to clinical guidelines.

The aggregation method, in particular, plays a very important role. First, it is affecting the total number of nodules that are retained, since median aggregation leads to more nodules having an indeterminate consensus and consequently being removed, compared to mean aggregation. It is fair to say that these nodules, which are retained in the dataset with mean aggregation, are harder examples, and therefore, the classification task that occurs from mean aggregation is more difficult. Second, the prevalence of the two classes in the dataset changes substantially, since median aggregation leads to a more balanced, and potentially more favorable for classification, dataset.

It is easy to understand that these choices change the nature of the underlying data distribution and hence, of the classification task itself. The comparison of the performance of different methods applied on different distributions is thus complex and makes the objective assessment of the value of methodological contributions difficult, which we also demonstrate experimentally. We initially devise several baselines. The first one is a simple thresholding based on the nodule diameter annotation. A size-relevant annotation is usually a core part of a lung nodule dataset, including the LIDC, and therefore this baseline can be applicable in all future studies. In the second baseline we apply a threshold on the diameter predictions that have been regressed by a neural network. This can indicate the degree of bias that a neural network has towards associating large nodules with malignancy and small ones with a benign nature. Given the very similar performance of the ShallowNet trained on malignancy prediction itself with the ShallowNet that is trained to regress the diameter, we understand that this bias is actually quite severe. It is well documented [11] that the size of the nodule is an important factor in determining whether a nodule is benign, but from a clinical perspective there are also other indications such as texture or spiculation, which do not seem to be picked up by the neural network. The aforementioned baselines can describe the difficulty of the task, and we suggest their adaptation by the research community working on lung nodule classification. Additionally, we intend to publicly release our sample selection and we urge the research community to do the same to promote reproducibility.

The core argument of our paper is epitomized when we compare the performance of all methods on the two distributions. Overall, we see that on \(\mathcal {D}_{1}\), adding data augmentation or increasing the complexity of the model (i.e. Local-Global instead of ShallowNet) consistently leads to a distinct increase in performance. The approach of using CBS during training results in a performance increase on every single method, outperforming marginally even the state-of-the-art (Local-Global w/augmentations) on \(\mathcal {D}_{1}\). However, on \(\mathcal {D}_{2}\), all methods are bounded by the diameter threshold baseline and even CBS is not having the impact it did on \(\mathcal {D}_{1}\). This highlights the pitfalls of sample selection which may lead to incorrect conclusions about the methodological contributions. If we were to report only results on \(\mathcal {D}_{1}\), we may have concluded that CBS is beneficial for lung nodule classification, and even outperforms previous works.

6 Conclusion

In this paper we have investigated the effect of sample selection in the context of lung nodule classification using deep learning. We have investigated different factors that cause the various published studies to report completely different number of nodules, and we show experimentally that these factors explicitly affect network performance. We have demonstrated that using progressively more and more complex methods systematically improves performance on the task, if and only if the assumptions regarding the data selection process allows for it. On the other hand, if the data distribution presents a more challenging classification task, as is the case when mean aggregation for the nodule annotations is used, then model complexity or data augmentation do not offer any kind of performance boost compared to even the simplest baseline.