Introduction

Solitary pulmonary nodules (SPN), usually incidentally discovered on a chest radiograph or chest computed tomography (CT) scan performed for any reason are the most common manifestation of lung cancer [1]. Detecting and diagnosing SPN is critical, as early identification of malignant nodules is crucial to the chance for successful treatment. Solitary pulmonary nodules, however, are usually small, located deep in the pulmonary parenchyma, and often yield atypical imaging findings. While researchers are seeking advanced image techniques [26], more and more clinicians have stressed that in addition to the radiographic and image characteristics, such as size, edge, or contour, clinical information such as age and smoking history should also be factored into the decision of whether a nodule is benign or malignant. Confronting such massive data, clinical experience and judgment may not be reproducible or reliable, whereas a quantitative model might have advantages in accuracy and reproducibility, will not be uninfluenced by personal judgment, and can provide outcome exchange ability.

With these advantages in mind, researchers began to suggest that a clinical prediction equation has the potential to facilitate clinical decision making [712]. One widely cited prediction model was proposed by Swensen et al. [8, 9] at the Mayo Clinic, who have reported that the capability of a mathematical model in judging the characteristics of SPN is similar to the clinical judgment of experienced physicians. Nevertheless, one of the limitations in their model is that 12% of the patients did not have a final diagnosis. Recently, Gould et al. [10] at Department of Veterans Affairs (VA) Cooperative Study reported an excellent agreement between the predicted probability and the observed frequency of malignant SPN. Their area under the curve of the receiver operator characteristic (ROC) is 0.79 (95% Confidence Interval [95% CI], 0.74–0.84). However, their model requires external validation in an independent cohort of patients with SPN, their clinical predictors are relatively small, and they were unable to evaluate spiculation and a remote history of extrathoracic cancer.

To improve the accuracy of the model, systematic and comprehensive clinical and imaging data collection with specific diagnoses is needed. Mery et al. [13] reported that age, tumor size, smoking history, and time of smoking cessation were independent factors, and Zhang [14] reported patient age, tumor size, and CT imaging were independent factors. The differences in the independent factors emphasize the significance of the source of the data (i.e., foreign versus domestic clinical data). In the present study, we aimed to develop a prediction model for patients with SPN based on a comprehensive data collection and thorough analysis.

Materials and methods

Clinical data

Between January 2000 and September 2009, medical records from 405 patients with a radiographic diagnosis of SPN were reviewed. Of these, 9 were excluded because data were incomplete and 25 were excluded because of a history of pulmonary or extrapulmonary malignancy in 5 years. In total, 371 cases were enrolled to create a mathematical model (group A). Clinical data collected include age of the patient, gender, smoking history, history of cancer, family history of cancer, calcification, spiculation, lobulation, pleural retraction, clear border, cavity, vascular convergence, tumor site (upper lobe or lower lobe, left or right), and diameter of tumor.

Clinical data were also collected from an additional 150 patients with a radiographic diagnosis of SPN between October 2009 and May 2011. Of these, 5 cases were excluded because of a history of pulmonary or extrapulmonary malignancy in 5 years. In total, 145 cases were enrolled to test the constructed mathematical model (group B).

Surgical procedure

All patients included in groups A and B underwent surgical resection of pulmonary nodules, after which a definitive pathologic diagnosis of an SPN as benign or malignant was established. Surgical procedures included tumor enucleation, pulmonary wedge resection, and lobectomy.

Statistical analysis

SPSS13.0 software (2004, IBM, Armonk, NY) was used for statistical analysis. For single factor analysis, the information for group A was analyzed to identify all factors affecting the probability of malignancy for SPN. Multivariate logistic regression was then performed to select independent predictive factors. A mathematical model for SPN was subsequently devised based on the results of the multivariate logistic regression. Receiver operator characteristic curves were created and the areas under the curves were calculated. Appropriate cut-off points were determined and the sensitivity, specificity, positive predictive value, and negative predictive value were calculated. A p value < 0.05 was considered statistically significant.

Results

Pathology revealed 142 cases of benign disease (38.3%) and 229 cases of malignant disease (61.7%) in group A, and 47 cases of benign disease (32.4%) and 98 cases of malignant disease (67.6%) in group B. There was no significant statistical difference in age, sex, or nodule diameter between the two groups (p > 0.05).

Univariate and multivariate analyses

The univariate analysis results are shown in Table 1. There were significant differences in age, smoking history and smoking, family history of cancer, calcification, spiculation, lobulation, pleural retraction sign, clear border, and maximum tumor diameter between the benign and malignant SPN patients. Of these, age, family history of cancer, spiculation, calcification, clear border of SPN, and maximum tumor diameter in benign and malignant SPN were identified as independent risk factors for malignant and benign SPN through multivariate logistic regression analysis (Table 2).

Table 1 Univariate analysis of data collected from patients included in group A
Table 2 Multivariate logistic regression analysis

Model construction

The following formula was employed to describe the malignant probability:

$$ \begin{gathered} p = e^{x} /\left( {1 + e^{x} } \right),x = \hfill \\ - \,4.496\, + \,\left( {0.07\, \times \,{\text{Age}}} \right)\, + \,\left( {0.676\, \times \,{\text{diameter}}} \right)\, + \,\left( {0.736\, \times \,{\text{spiculation}}} \right)\, + \, \hfill \\ \left( {1.267\, \times \,{\text{family history of cancer}}} \right)\, - \,\left( {1.615\, \times \,{\text{calcification}}} \right)\, - \,\left( {1.408\, \times \,{\text{border}}} \right), \hfill \\ \end{gathered} $$

where e is the natural logarithm, and the value for the last four elements, i.e., family cancer history, calcification, spiculation, and border, equals 1 if the element exists, and 0 otherwise.

The clinical data of the 371 cases included in group A were used in the mathematical model to calculate a predicted probability of malignancy for all patients. A p value of 0.463 was ultimately selected as a cut-off point and p values > 0.463 should be considered malignant disease and p < 0.463 should be considered benign.

Model validation

Clinical data of the patients in group B were then used to test the accuracy of the model by comparing the calculated result (p value) with the pathology results; ROC curves were then created (Fig. 1). The area under the ROC curve was 0.874 ± 0.028. The sensitivity of this model for group B was 94.5%, the specificity was 70.0%, the positive predictive value was 87.8%, and the negative predictive value was 84.8%.

Fig. 1
figure 1

Receiver operator characteristic (ROC) curve generated using our proposed model

In addition, two established foreign mathematical models were tested using group B data, the Mayo model [4] and the VA model [5].

The mayo model

Independent factors in the Mayo model were age, smoking history, cancer history, diameter, spiculation, and site in left side. The calculation was based on the formula

$$ p = e^{x} /\left( {1 \, + e^{x} } \right) $$

where x = −6.8272 + (0.0391 × age) + (0.7917 × smoking history) + (1.3388 × cancer history) + (0.1274 × diameter) + (1.0407 × spiculation) + (0.7838 × the upper lobe). The area under the ROC curve of group B was 0.784 ± 0.038 (Fig. 2).

Fig. 2
figure 2

Receiver operator characteristic curve generated using the Mayo model

The VA group model

For the VA model, independent factors were age, smoking, tumor diameter, and time of smoking cessation. The equation was

$$ p = e^{x} /\left( {1 \, + e^{x} } \right) $$

where x = −8.404 + (2.061 × smoking history) + (0.779 × age) + (0.112 × diameter) − (0.567 × time of smoking cessation). The area under the ROC curve of group B was 0.754 ± 0.040 (Fig. 3).

Fig. 3
figure 3

Receiver operator characteristic curve generated using the VA model. Diagonal segments are produced by ties

The area under the ROC curve of the mathematical model created in the present study was significantly higher than the other two foreign forecasting models (p < 0.05), as described in Table 3 and Figure 4.

Table 3 Comparison of the receiver operator characteristic (ROC) curves of three mathematical models analyzed in this study
Fig. 4
figure 4

Comparison of ROC curves generated using the three models analyzed in this study, the area under ROC curve of our new model is significantly higher than that of the Mayo and VA models. Diagonal segments are produced by ties

Discussion

Solitary pulmonary nodule evaluation is a mathematical tool of major interest to clinicians. It provides a conventional method for establishing a prediction model after univariate and multivariate regression analysis. Up until now, such mathematical models have all been studied retrospectively. The present work is also a retrospective study; however, to surpass the previous works, it gives the most comprehensive data collection of both clinical and imaging information, with a definitive pathology diagnosis for each patient. This has never been done in any of the studies reported to date, and it cannot be achieved by prospective study. Because all the patients were pathologically diagnosed through surgical treatment, only the retrospective method can get such accurate results. Multivariate regression analysis for modeling is widely used for this kind of research, and we followed this classic method without much improvement. Thus our model is more accurate because it is based on comprehensive and systematic data collection, especially with a clinical pathologic diagnosis. Evidenced by the ROC curves and compared with the widely accepted models, the more comprehensive the data collected, the more reliable the model will be.

Among the independent factors determined by multivariate analysis, age [8], nodule diameter [1215], calcification [36], spiculation [7], and border [16] have been reported before, but family history of cancer is a new independent factor that has not been studied yet. Although malignant tumor has not been demonstrated to show any specific genetic characteristic, several researchers [1721] whose studies have been published in the past decade have indicated that tumorigenesis might be a compound result of genetic and environmental factors; i.e., cancer arises from a genetic predisposition and is stimulated by environment influences. This topic has attracted considerable attention in recent years. We believe that a genetic susceptibility to cancer or a hereditary predisposition to cancer could significantly increase the malignancy probability in patients. It provides us another consideration to evaluate SPN.

In addition, it has been reported [3] that smoking and nodule location are independent factors in malignancy of SPN. In the present study, however, both have been proved to be dependent factors. It is possible that there were more patients with adenocarcinoma in the present study. The relationship between smoking and adenocarcinoma is not as clear as that with squamous cell carcinoma. Alternatively, all of the patients in this study underwent a surgical procedure to obtain a pathologic diagnosis. Some non-smoking SPN patients whose SPN were determined to be benign did not undergo operation and were not included in this study. The ratio of non-smoking patients with benign nodules in our group decreased correspondingly, so whether a history of smoking is significantly associated with malignancy in patients with SPN requires further study. As for nodule location, there were a high number of patients in this group with tuberculosis, which is known to occur in the upper lobe. Also, in China, the incidence of tuberculosis is higher than that in Western countries. This study suggests that there is no significant correlation between nodule location and the probability of malignancy of SPN.

The model was well fitted to the independent cohort of 145 patients seen from October 2009 to May 2011, which is a good demonstration of the accuracy of the prediction model. In addition, we compared our model with two previously published models [9, 10]. For the three models tested, the area under the ROC curve was 0.874, 0.784, and 0.754, and there were significant differences among the three groups. The larger the area under the curve, the more accurate the mathematical model appeared. Based on this assertion, the new model described herein was significantly better than the existing formulae. Moreover, the cut-off value for the model was obtained by calculating the optimum sensitivity and specificity. p values > 0.463 should be considered malignant disease and p < 0.463 should be considered benign. The sensitivity of this model was 94.5%, the specificity was 70.0%, the positive predictive value was 87.8%, and negative predictive value was 84.8%.

In conclusion, age of patient, diameter, border, calcification, spiculation, and family history of tumor were independent predictors of malignancy in patients with SPN. The devised prediction model was more accurate than two previously described models and was able to predict malignancy in patients with SPN. Although the mathematical models provide an objective basis for judging the character of SPN, it remains a clinical tool in that it cannot be used as a substitute for a pathologic diagnosis. Previous reports have indicated that the possibility of malignancy in patients with SPN is very high (80% or more) [2224]. As patient age increases, the possibility of malignancy also increases significantly. Therefore, clinicians need to seriously consider all SPN, especially in patients with the six risk factors noted above.