Introduction

Compelling clinical studies have shown a benefit of lung cancer screening, which allows for the early diagnosis and treatment of lung cancer. A critical issue is the diagnosis of a pulmonary nodule as benign or malignant. Current lung cancer screening practice is to identify pulmonary nodules on annual low-dose CT scans and to apply a follow-up procedure, such as another CT scan or a fine needle biopsy, to suspicious nodules to determine their malignancy status. We consider here how that malignancy status may be determined from just the initial CT image of the nodule.

Pulmonary nodule size

It has been universally recognized that probability of malignancy is correlated with the size of pulmonary nodules; the size of nodules is always noted in radiological reports and size is used in radiological staging and for determining the follow-up in lung cancer screening. The conventional way of recording size is to make a single or the average of two “diameter” measurements across a central image slice through the nodule and is expressed in mm. More recently, with the advent of volumetric measurement methods, the size is represented by the volume of the nodule, which is expressed in \(\hbox {mm}^{3}\). While the former is more conventional and understandable to most physicians, the latter directly relates to the amount of information (number of pixels) available in the CT image. In this paper, we will use both measures; in discussions, we will relate the equivalent diameter D of a nodule given a volume V by:

$$\begin{aligned} D=\left( {\frac{6}{\pi }V} \right) ^{\frac{1}{3}} \end{aligned}$$

Therefore, in the context of this paper, diameter refers to a surrogate for the measured volume of the nodule and it does not correspond to any actual single-dimensional measurement made on the nodule image.

Further, the size range of nodules under consideration for a classifier is important for nodule classification. We specify the size range R as the ratio of the largest to smallest volume of nodules in a dataset.

$$\begin{aligned} R=\frac{V_\mathrm{{largest}} }{V_\mathrm{{smallest}} } \end{aligned}$$

Nodule size in lung cancer screening

In lung cancer screening, the objective is to identify cancers at the earliest stage; that is, when they are the smallest in size, they are the most curable and simpler treatment options may be available. Small-size nodules have less image information in CT images than large nodules due to the number of fixed-size image pixel elements (pixels) that they span. Further, in screening, CT scans are set at a low-dose as the primary task is detection; therefore, the image noise is much higher than for regular CT scans. For a 1-mm thick slice whole-chest CT scan using the conventional 512 \(\times \) 512 image size, the volume of each pixel is on the order of \(0.5\,\hbox { mm}^{3}\); therefore, a very rough estimate of the number of pixels in a nodule image is to double the volume. While nodules may be visible to a physician in an apparent 1–2 mm size range, the image information is limited. For example, a 2 mm nodule spans in the order of 8 pixels, a 3 mm nodule 27 pixels, a 4 mm nodule 64 pixels and a 5 mm nodule 620 pixels; further, for all these cases, a large majority of these pixels are partial pixels; that is, they consist of a mixture of the nodule tissue and the surrounding lung tissue.

The larger size limit of interest is 15–20 mm. Nodules larger than this generally have a high probability of malignancy and very infrequently occur in the main repeat rounds of screening. Such nodules may be detected in the first baseline screening but should not occur in repeat rounds if appropriate small-nodule follow-up procedures are correctly followed. At the large end of the range scale, we have the most image information—a 15–20 mm nodule image has on the order of \(10^{6}\)\(10^{7}\) pixels. However, this upper end of the size range is much less clinically interesting since we aspire to identify cancers at an earlier stage and time when they are much smaller in size.

An alternative to characterizing a single CT image is to measure the nodule growth rate from two or more images [1]. However, this approach is not currently supported by volumetrically calibrated CT scanners and also requires a delay in the diagnosis required for a measurable change to occur in the nodule between scans.

Size bias in feature evaluation

Obviously, size is a very important image characteristic for determining the probability of malignancy. In this paper, we explore image features other than size in order to provide an improved probability estimate. Since size is easily determined, the main question of interest is what is the probability of cancer at a given size rather than what is the probability of cancer with respect to distribution of sizes.

A major issue in exploring pulmonary nodule characterization is to acquire a large enough sample of both malignant and benign pulmonary nodule images with known outcomes. It is tempting to use all possible data, but the danger here is that the size distribution for the benign nodules may have a much smaller mean than the size distribution for the malignant nodules. The results of the evaluation then reflect the natural difference in size distribution of the datasets rather than other characteristics of the images.

Our hypothesis is that the size distribution difference may become the largest factor in the performance evaluation of datasets with different distributions. We test this hypothesis in two ways. First, we evaluate a size-based classifier that uses size as the only feature on which to predict malignancy, and second, we have constructed datasets with balanced size distributions. We compare the results of the size classifier and the balanced datasets to the outcome of the traditional size-blind approach [211]. We also consider the impact of training using size binning. That is, using a set of size-specific classifiers instead of a single size-independent classifier.

Image features

The general approach for computer-aided classification as applied to malignancy diagnosis is to first establish a dataset of images with known outcomes from both classes. A large number of image features (often termed texture measures) are computed for all images in the dataset, and a subset of the features with the best diagnostic performance are selected for the final classifier. In traditional computer vision for conventional video images, there are a number of “texture” features that are classically used. We placed less importance on these features given the ways that the CT data differs from conventional images; for example, (a) the small number of pixels in a nodule, (b) the large amount of image noise and (c) CT images are 3D and have calibrated pixel values. We included nontraditional image features for evaluation including 3D geometry features, 3D features of the density distribution, surface curvature features and features of the nodule margin.

In our preliminary studies [12], we showed that test set size distribution imbalance had a major impact on the perceived outcomes of other studies [2, 3] and that size balancing diminished the ROC AUC. Related work [13] showed that 3D image features based on all the image pixels of a nodule were more effective than 2D image features based on just the central image slice of the nodule.

In this paper, the issues in evaluating nodule characterization by image features in the context of lung cancer screening are explored with a system that includes novel 3D image features. Balanced and unbalanced evaluation datasets are used to determine the impact of size balancing and size binning.

Methods

Dataset selection

We combined image data from the two largest lung cancer screening studies, the Early Lung Cancer Action Program (ELCAP) [14] and the National Lung Cancer Screening Trial (NLST) [15]. Malignant nodules were included if there was a pathologically proven cancer diagnosis; benign nodules were included if there was 2 years of no clinical change or a benign pathologic diagnosis.

Pulmonary nodules may be solid, part-solid or nonsolid. Solid nodules are the most common type for cancer and consist of a mass of invasive cells that typically have CT image intensity similar to that of soft tissue. Nonsolid nodules typically have abnormal cells distributed on the epithelial surface of the airways. Hence, the associated lung parenchyma has a higher CT image intensity than normal lung parenchyma but less dense than soft tissue or solid nodules. Little is known about nonsolid nodules compared to the more typical solid nodules. One lung cancer screening study reported that 17 % of the cancers were nonsolid nodules [16]. It has been suggested that the part-solid nodules that contain both solid and nonsolid components may occur when the cancer becomes invasive and a more traditional solid nodule is developing. From an image analysis viewpoint, nonsolid nodules have a very different visual presentation compared to solid nodules and are more challenging for image segmentation. Clinically, nonsolid nodules are considered to be more slow growing than solid nodules and also harder to measure; screening protocols usually have a different management for these nodules.

Given the different visual presentation of nonsolid nodules and their small numbers in our databases, only solid nodules or the solid component of part-solid nodules were included in this study. Nonsolid nodules will be considered in a future study when more images are available. It is likely that a separate image analysis system for the nonsolid subtype may produce the best analysis outcomes.

In our study’s two datasets, the first dataset contained cases selected from the Weill Cornell Medical Center database (which is part of the ELCAP study) that had at least one solid or part-solid nodule on at least one thin-slice CT scan. Part-solid nodules were only included if they were comprised primarily of a solid component. The status of malignant nodules were determined by either biopsy or resection, while the status of benign nodules was established through a negative biopsy result or by 2 years of no clinical change by a board certified radiologist. All CT scans had a slice thickness of 2.5 mm or less. Metastatic cancer and benign calcified nodules were excluded. A total of 259 nodules (167 malignant and 92 benign) with CT scans of 1.0, 1.25 or 2.5 mm slice were included. Approximately 13.9 % (36/259) of the nodules were on 1.0 mm scans, 73.8 % (191/259) on 1.25 mm scans and 12.4 % (32/259) on 2.5 mm scans. Scans were obtained using GE Medical Systems scanners. The Weill Cornell image acquisition time period was 1994–2007, and the majority of the Weill Cornell instances were reconstructed using the BONE kernel.

The second dataset contained cases selected from NLST. Participants underwent three rounds of screening at 1-year intervals. Cancers were identified through the NLST protocol. After three rounds, abnormalities suspicious for lung cancer that were stable across the three rounds were classified as minor abnormalities (i.e., benign). We selected NLST CT scans with a slice thickness less than or equal to 3.2 mm. A total of 477 nodules (245 malignant and 232 benign) with CT scans of 1.0, 1.25, 1.3, 2.0, 2.5, 3.0 and 3.2 mm slice thickness were chosen. Approximately 2.94 % (14/477) of the nodules were on 1.0 mm scans, 2.73 % (13/477) on 1.25 mm scans, 0.63 % (3/477) on 1.3 mm scans, 37.74 % (180/477) on 2.0 mm scans, 48.01 % (229/477) on 2.5 mm scans, 0.21 % (1/477) on 3.0 mm scans and 7.76 % (37/477) on 3.2 mm scans. Scans were obtained using a wide range of scanners including Siemens, GE Medical Systems, Philips and Toshiba scanners. For NLST, the screening time period was 2002–2007; NLST images were reconstructed with a variety of reconstruction kernels including STANDARD and BONE (for GE scanners) and B30f and B50f (for SIEMENS scanners).

Nodules were selected to meet the 3D feature image quality criterion; that is that they spanned at least three image slices and preferably four or more. Further, all nodules had a diameter between 3 and 30 mm. The volume of each nodule was computed from automated segmentation [17], and the nodule size was represented as the equivalent diameter of a sphere with the equivalent volume as the nodule. Only one instance of a nodule was used per case.

For both datasets, we used methods to minimize the size distribution differences between malignant and benign. For the ELCAP dataset, we selected all the large benign nodules that were available to match the sizes of the cancers. For the NLST, we sought to minimize the size of the cancers by selecting the first CT image in a longitudinal sequence where possible.

Figures 1 and 2 show the nodule size distribution for the Weill Cornell and NLST dataset. By combining these two datasets, we created a database with 736 nodules (412 malignant and 324 benign). Figure 3 shows the size distribution for the entire database, and Table 1 gives the statistics for size distribution for malignant and benign nodules.

Fig. 1
figure 1

Weill Cornell nodule subset size distribution

Fig. 2
figure 2

NLST nodule subset size distribution

Fig. 3
figure 3

Full dataset size distribution

Table 1 Size statistics for the main datasets

Size-balanced nodule dataset

A size-balanced subset of nodules (GA) was created from the full database to assess the impact of size on the classification result (see Table 2). First, all malignant and benign nodules were divided into bins based on their volumetric derived diameters (3, 4, 5 mm, etc). Then, bins smaller than 5 mm were discarded since these nodules were too small for the shape-related features to be effective. Bins larger than 14 mm were discarded due to the lack of data (usually less than three nodules per bin). For the remaining bins (5–14 mm), the same number of malignant and benign nodules was randomly selected to maximize the number of nodules in each bin. We explored two binning strategies: the first was to create three bins each with a similar size range and the second was to partition into just two bins (by combining the two largest size bins) so that each bin would have a similar number of nodules (see Table 3). For the first binning strategy, the first bin (G6) only includes nodules with a size from 5.0 to 7.0 mm; the second bin (G8) includes nodules with a size from 7.0 to 9.0 mm; the third bin (G12) includes nodules with a size greater than 9.0 mm. The three bins were designed so that each bin would have a sufficiently large number of nodules and the volume range within each bin would be similar (see Table 3 volume range). For the second binning strategy, the first bin contains G6 nodules and the second bin combines both G8 and G12 nodules.

Table 2 Size-balanced nodule size distributions
Table 3 Size-balanced nodule size distributions with binning

In total, 163 malignant and 163 benign nodules were selected to have as similar size distribution as possible. In the size-balanced dataset, 44.79 % (146/326) nodules had a size between 5.0 and 7.0 mm, 28.22 % (92/326) nodules had a size between 7.0 and 9.0 mm and 26.99 % (88/326) nodules had a size between 9.0 and 14.0 mm (Fig. 4).

Fig. 4
figure 4

Size-balanced subset nodule size distribution

Image features

In this work, 46 3D features [18] were computed from the segmented nodule images. These features are grouped into four categories: morphological, density, surface curvature and margin gradient (see Table 8 in “Appendix”). Images were resampled to \(0.25\,\hbox {mm}^{3}\) isotropic resolution for feature evaluation [18].

Morphological features describe the shape characteristics of the nodule and are derived from standard image moments [19]. Radiologists use the nodule shape as an indicator of malignancy; for example, Takashima et al. [20] identified a greater prevalence of polygonal shape and 3D ratios in benign nodules compared to malignancies. The morphological features are: volume, surface area, volume-to-surface area ratio, compactness, sphericity, attachment ratio, length/width/height of the ellipsoid of inertia, ratios of the length/width/height, the roll/pitch/yaw of the ellipsoid of inertia and the scale-normalized second-order morphological moment.

Since the gray levels of a CT scan are representative of the density of the tissue, density features can be derived from the gray-level voxel values of the image. One of the density characteristics often used by radiologists is the average density of the nodule—whether the nodule is solid, part-solid or nonsolid has a significant effect on the interpretation of the nodule. The density features analyzed in this work are: density mass, mean density, the standard deviation, skewness and kurtosis of the density histogram, the length/width/height of the density-based ellipsoid of inertia, the ratios of length/width/height, and the scale-normalized second-order densitometric moment.

The surface features of a nodule are often considered by radiologists in determining nodule malignancy status. These features are represented by the surface curvature features, which measure the rate of change of the surface normal to the length of the surface. Although the surface curvature can be computed directly from the gray-level voxels [21], errors are introduced from the fact that the voxels are rectangular approximations of the nodule surface. To address this problem, the surface curvature is estimated from a smoothed polygonal tessellation of the segmented binary nodule image as described by Jirapatnakul [22]. To generate the tessellation, the marching cubes algorithm developed by Lorensen and Cline [23] was used. This algorithm results in triangles located at angles that are multiples of \(45^{\circ }\); to improve the surface representation, the polygonal tessellation was smoothed by modifying the position of each vertex as a weighted sum of the neighboring vertices and itself. Once the smoothed polygonal representation is obtained, the surface normal of each triangle can be computed. From the surface normal of each triangle, the surface normal of each vertex can be computed as the average of the surface normals from each triangle of which it is a member. Finally, the curvature for a triangle is computed as the average difference between the surface normals at each vertex. The mean, minimum, maximum, range, standard deviation, skewness and kurtosis of the curvature distribution were included as features.

The final category of features is the nodule margin. The nodule margin refers to the boundary of the nodule and the surrounding lung parenchyma. While the surface curvature features capture the shape of the nodule at the margin, the margin gradient features measure the density changes that occur at the margin. To compute the margin gradient, the surface normals for each triangle in the nodule surface representation are used. These normals are computed in the process of computing the surface curvature, as described in the previous paragraph. In addition to the surface normals, gradient images in each direction (x, y, and z) are created from the resampled isotropic grayscale images using a 3D operator proposed by Monga et al. [24, 25]. At each triangle, ten gradient samples are taken along the surface normal vector through the center of the triangle. The highest gradient value is recorded for the triangle. The mean, minimum, maximum, range, standard deviation, skewness and kurtosis of the distribution of gradients were included as features.

Feature classification

Five different classifiers were evaluated: the distance weighted k-nearest-neighbors classifier (dwNN) [26], the Support Vector Machine (SVM) classifier [27] with a polynomial kernel (SVM-P), SVM with a Radial Basis Function kernel (SVM-R), the logistic regression classifier (LOG) and the size threshold classifier (Size-C). For dwNN, SVM (polynomial and RBF) and LOG classifiers, fivefold cross-validation approach was used for training and testing. In the training stage, training set was further divided into train and validation for parameter optimization using fivefold cross-validation. The final classification outcome was represented by the average ROC curve and the area under the ROC curve (AUC) obtained using the five ROC curves from fivefold cross-validation. The threshold averaging method was used for ROC averaging (Fawcett et al. [28]).

Compared to the conventional K-Nearest Neighbors classifier, the dwNN classifier weights each neighbor n of a feature vector based on their distance \(d_{n}\). The weight \(w_{n}\) is computed as follows where \(\sigma \) is a constant that controls the impact of each neighbor on the classification outcome. In the training stage, a grid search was performed to find the optimal \(\sigma \) (see Table 9 in “Appendix”).

$$\begin{aligned} w_n =\frac{1}{\mathrm{{exp}}\left( {\sigma *d_n } \right) } \end{aligned}$$

The SVM classifier was implemented using the \(\hbox {SVM}^{\mathrm{light}}\) library [29]. For the SVM with polynomial kernel (SVM-P), the two parameters obtained from training were the order of polynomial kernel d and the trade-off between training error and margin c. The search space for d and c is shown in Table 9 in “Appendix.” Joachims [29] stated that \(c=0.001\) is acceptable for most tasks and a larger c leads to considerably longer training time. For SVM with RBF kernel (SVM-R), the two parameters obtained from training were the weighting factor in the polynomial kernel g and the trade-off between training error and margin c. The search space for g and c is also shown in Table 9 in “Appendix.”

For the LOG classifier, Peduzzi et al. [30] have shown in a simulation study that for each feature, LOG would require at least ten positive and ten negative samples to avoid bias. In the training stage, each feature was ranked based on its individual AUC and the top n features were selected. The search space for n is shown in Table 9 in “Appendix.”

In addition, results for the size-only classification scheme were computed. For a given size threshold T, the size classifier indicates that all nodules with a size greater than T are malignant and all nodules with a size less than T are benign. The evaluation metric was the AUC, which was achieved by varying the size threshold T through the size range of the nodules in the dataset; therefore, no training was required for this classification method. The size classifier provides information on the size imbalance within the malignant and benign size distribution of the test dataset—the greater the size imbalance, the higher the AUC.

Experiments

Two main experiments were performed: the first to evaluate the impact of class size distribution imbalance by comparing the size-only classifier to methods using additional image features, and the second to evaluate the impact of using size-balanced datasets. In all experiments, the full set of image features and all five classifier types were considered. The organization of the experiments is illustrated in Fig. 5. The main dataset All Data consists of the two trial cohorts. A size distribution-balanced dataset is selected from All Data for the second experiment, and it is further partitioned into size bins for the binning classifiers.

Fig. 5
figure 5

Overview of the experiment organization

In the first experiment, the traditional method for training and testing using cross-validation on All Data was evaluated. In addition, to illustrate the impact of the size imbalance, the classifiers trained on just one of the two data cohorts are also evaluated with All Data.

In the second experiment, the performance of three different classifiers trained on only the Balanced Data subset is compared to the traditional classifier trained in All Data using three different training strategies with the data subsets shown in Table 3. First, the performance of this dataset (GA) using the unbalanced full-data-trained classifier from the first experiment was measured. Second, the performance of the balanced data (GA) trained (through cross-validation) with itself was evaluated. Third, a binned training strategy was evaluated where the balanced dataset was partitioned into different sized groups and each classifier was trained for each group using only nodules in the same size group. A two bin grouping using bins G6 and G8 + G12 and a three bin strategy using bins G6, G8 and G12 were evaluated.

The usual metric for classifier performance is the area under the curve (AUC). Since in this context much of this performance is attributed to difference in the test set size distribution, an additional metric, the incremental increase in AUC compared to a size classifier (IAUC), was considered to be more relevant. The DeLong test [31] was used to assess pairs of ROCs. It estimates a covariance matrix from two ROC curves, which may also be used to construct confidence regions and compute the statistical significance of the difference between the two AUCs.

Results

In the following tables of AUC results, the mean AUC value for the fivefold cross-validation is reported, together with the standard deviation in parenthesis. Also the p value of the Delong test with respect to the size classifier is given.

Results for the size-unbalanced dataset

The result for the full unbalanced dataset is shown in Table 4. A comparison of the different training datasets using an SVM-P classifier is shown in Fig. 6. A comparison of the different classifiers for the full unbalanced dataset is given in Fig. 7. For LOG on the full unbalanced data (Table 4, all row), the optimal set of features is listed in Table 10 in “Appendix.” For the full unbalanced data, each classifier’s ROC was compared to size classifier’s ROC and their p values are listed in Table 4. Values listed in bold in Table 4 indicate the ROC appears in either Figs. 6 or 7.

Fig. 6
figure 6

ROC curve for SVM with Polynomial kernel on the two training dataset separately (red for Weill Cornell and green for NLST) and combined (blue). The size classifiers on Weill Cornell (black), NLST (brown) and combined dataset (magenta) are also shown

Fig. 7
figure 7

ROC curve for dwNN (red), SVM with polynomial kernel (green), SVM with RBF kernel (blue), logistic regression (black) and size classifier (magenta) on the full unbalanced dataset

Table 4 Classifier performance (AUC) for the unbalanced data sets

Results for the size-balanced dataset

The results for the balanced dataset with the unbalanced training and the balanced training schemes are shown in Table 5. Each classifier was also compared to the size classifier, and the p value is given. Table 6 shows the results using different training conditions for the two binning strategies. The training conditions were: unbalanced binned training and balanced binned training. For each training condition, only the result from the best classifier is shown. The two binning strategies were: three bins (G6, G8 and G12); two bins where the two larger bins G8 and G12 were combined into one large bin and the same experiments were repeated using a small bin (G6) and a large bin (G8 + G12), each with a similar number of nodules. Table 7 shows the overall performance for the binned classifiers (three bins and two bins) under different training conditions.

Table 5 Classifier performance (AUC) for the balanced dataset GA trained on balanced and unbalanced data
Table 6 Best performance AUC for different evaluation data sets: G6, G8, G12 and Large (G8 + G12)
Table 7 Best performance AUC for binned classifiers (3-bin and 2-bin). Size-C AUC = 0.510

The ROC curves for the full balanced dataset with each classifier with balanced training (GA balanced) are shown in Fig. 8. The ROC curves for GA with each classifier with unbalanced training (GA unbalanced) are shown in Fig. 9. Figure 10 shows the ROC curves for the best classifier under each training condition: unbalanced training, balanced training, overall performance using three bins (G6, G8 and G12) and overall performance using two bins (small and large). For testing on GA set using balanced and unbalanced training, the optimal features for LOG are listed in Table 11 in “Appendix.”

Fig. 8
figure 8

Comparison of classifiers on the size-balanced dataset GA using balanced training

Fig. 9
figure 9

Comparison of classifiers on the size-balanced dataset GA using unbalanced training

Fig. 10
figure 10

Comparison of the best classifier under different training conditions when tested on the size-balance dataset GA

Discussion

Due to the data selection methods for size balancing and the image quality requirements neither of the size distributions for the Weill Cornell nor NLST data accurately reflect the size distributions of the subjects in lung cancer screening studies; however, the general distribution for the cancers is representative as we selected all usable malignant nodule images. This is not the case for the benign nodules since these were selected with a view to size balancing. In the full studies, there are many more small benign nodules.

Pulmonary nodule classification from screening CT images acquired for nodule detection is a very challenging task given the small size of the nodules and the large amount of image noise. From Table 4 and Figs. 6, 7, we see that the size classifier, which is only sensitive to the difference in the size distributions for benign and malignant nodules, provides an AUC of 0.725 for the combined dataset. This number would have been much higher (and comparable to other published studies) if we had included the very large number of small benign nodules that were documented in the full screening studies. The size classifier ROC curves in Fig. 6 for the two individual study datasets show very similar properties with a slightly larger size imbalance for the NLST dataset. Note, in Fig. 6, the best evaluation results are superior to but follow most closely the size evaluation curve of the All Data test set even when the classifier is trained only by a single cohort.

In Fig. 7, we see a comparison of the different classification methods used for the combined full unbalanced dataset. Very little difference is noted; the best classifiers (SVM-P and SVM-R) have an IAUC of only 0.047 over the size classifier. The average improvement is 0.039. From Table 4, we see that the IAUC is only statistically significant for the two SVM classifiers \((p<0.05)\).

The results for the size full balanced dataset are shown in Tables 5, 6, 7 and Figs. 8, 9, 10. The size classifier has an AUC of 0.510; for a perfectly balanced dataset the value would be 0.500. For the balanced data test set GA, the unbalanced classifier (which claimed an AUC of 0.772 when evaluated on all nodules) only achieved an AUC of 0.642 (IAUC of 0.132) compared to an AUC of 0.708 (IAUC of 0.198) using the balanced classifier. This difference in performance was statistically significant \((p=0.01)\).

In Tables 6 and 7 and Fig. 10, the AUC results for binning are shown. While all AUCs were statistically significant with respect to the size classifier, none of the balanced binned classifier was statistically significantly different with respect to the unbalanced binned classifier. However, the binned results show better AUC values compared to balanced training overall 0.742 (two bins), 0.726 (three bins) versus 0.708 (balanced training). Further, the small-nodule bin (G6) shows a lower AUC than the others under all training conditions. For the binned training, the IAUC for G6 was 0.145, while for the other bins, it was much higher (G8, 0.259, G12, 0.252; (G8 + G12) 0.277). This implies the image features are less effective for these small nodules. An improvement of performance of the 2-bin classifier is noted (0.742 vs. 0.708) although this is not statistically significant \(p=0.35.\)

Figures 11 through 13 provide examples of some of the image issues and demonstrate the range of presentations shared by both malignant and benign nodules. Figure 11 shows the impact of the image reconstruction filter on image quality, Fig. 12 shows malignant and benign nodules with similar complex presentations and Fig. 13 shows the impact of structured image noise.

Fig. 11
figure 11

The effect of the CT image reconstruction filter

Fig. 12
figure 12

Examples of nodules with similar complex presentations

Fig. 13
figure 13

Examples of images in which the nodules intensity is impacted by structured scanner noise

There have been a number of studies on characterizing the malignancy status of pulmonary nodules from CT images reported in the literature [211]; performances have been reported in terms of area under the curve (AUC) for ROC analysis in the range of 0.79–0.92. Of these studies, only three have used nodules from a lung cancer screening study [2, 5, 8]. These three studies all used the same dataset that has over 400 benign nodules and less than 80 malignant nodules and is dominated by a large number of very small benign nodules (smaller than any malignant nodule), which is a major factor in determining the AUC performance [12].

Limitations of this study

For this retrospective study, the data are on the order of 10 years old and does not reflect the impact of recent changes in CT technology. Many of the scans (65 %), especially those from the NLST (94 %), had a slice thickness greater than or equal to 2 mm which is half the image resolution specified by current lung cancer screening guidelines. We did not consider nonsolid nodules since these are a different phenotype and lack representation in our database in sufficient numbers; these nodules should be the subject of a future study.

The scans for this study span a wide range of CT models and parameter settings. No image preprocessing was performed to compensate for different image scanner parameters, especially with respect to image reconstruction filtering and image noise (see Figs. 11, 13); however, image resampling to isotropic space was performed for feature evaluation.

Conclusion

The task of nodule classification in the context of lung cancer screening has the following distinguishing characteristics: (a) low image resolution, (b) high image noise (c) tremendous size range of nodules, (d) different size distributions for benign and malignant nodules and (e) a large variation in CT scanner acquisition parameters. For a classification system to be relevant to lung cancer screening, all these issues need to be considered. Ignoring size issues may result in overlying optimistic performance results that reflect only the imbalance in the test set size distribution. This imbalance causes the system to confound the population-based difference in size distribution with the patient-specific image features of the nodule. The predictive power associated with the nonsize-impacted image features may be determined by using a size-balanced dataset.

In this study, we have explored the size issues using a large size-enriched dataset of 736 nodules by combining images from the two largest lung cancer screening studies. Our results indicate that there is a measurable improvement in the prediction of malignancy by using image features over size alone; however, the main predictor is size and this must be carefully accounted for when attributing the benefit of other image features. The overoptimistic performance and biased learning due to class size distribution differences can be avoided by using a size-balanced evaluation dataset. The tremendous size range of pulmonary nodules encountered in screening may be addressed by binning, that is, training a set of classifiers on a small nodule size ranges and selecting the size-specific classifier for a given case. In any case, appropriate representation of the large size range will require much larger datasets than the 736 cases that we used in this study.

In this study, the incremental improvement of the AUC over size was only 0.047 for the full unbalanced dataset. The balanced data test set had a statistically significant improved performance \((p=0.01)\) with the IAUC increasing from 0.132 to 0.198 when trained on the balanced data, which was further increased to 0.232 by using two bins. This provides a modest improvement over size information alone.

The population-based probability of malignancy based on size is a major prediction factor that is known a priori from the analysis of cancer screening studies and practice. The essential issue for a patient-based nodule characterization system is to determine the probability of malignancy conditioned on a size, rather than the joint probability of malignancy and size.

There are several technical improvements that may lead to improved classification performance including higher resolution images, standardization on scanner parameters and reduction in image noise.