Introduction

Lung cancer is one of the most common malignant tumours that endangers human health. In China and worldwide, the incidence and mortality of lung cancer are at the top of the list, and the trend is increasing annually [1]. Lung cancer represents a highly heterogeneous malignant epithelial tumour with distinct pathological features and clinical behaviour [2]. According to the histological size of cancer cells under a microscope, lung cancer can be divided into small cell lung cancer (SCLC) and NSCLC. The latter, which accounts for approximately 85% of lung cancer cases, includes ADC (~ 50%), SCC (~ 40%) and large cell lung cancer [3, 4]. The accurate staging and pathologic grading of lung cancer has important significance in the determination of a rational treatment regimen. For instance, Scagliotti et al. found that compared with docetaxel, pemetrexed significantly prolonged the overall survival and progression-free survival of ADC but had the opposite effect on SCC [5]. Hence, it is important to accurately distinguish between the two subtypes of NSCLC prior to initiating treatment.

Although experienced physicians can often diagnose the type of lung cancer based on clinical presentation and radiographic appearance, NSCLC is sometimes poorly differentiated and is distinguishable only by immunohistochemical staining and molecular testing. SCC is distinguished from ADC in the clinic by immunostaining for cytokeratin 5 and cytokeratin 6 and/or the transcription factors SRY-box 2 (SOX2) and p63 [3, 4, 6]. CT-guided transthoracic needle aspiration is typically the first-line method for peripheral lesions, and obtaining an adequate tissue sample is imperative for optimizing the diagnosis and treatment plan [7]. For small or peripherally located lung cancers, the current needle biopsy procedures sample only a small amount of tissue and have low accuracy. In some cases, CT-guided needle biopsy cannot be performed or is not suitable. Moreover, for deeply located lesions or lesions close to airways or blood vessels, needle biopsy is challenging. In patients with unfavourable situations, biopsy would not be recommended. Moreover, the tumour is often heterogeneous, which may affect the biopsy results. A non-invasive method for pathological classification prior to biopsy or surgery has not yet been developed.

Positron emission tomography and computed tomography (PET/CT) with the use of the 18F-fluorodeoxyglucose (18F-FDG) marker is an essential imaging modality for lung cancer [8, 9], and the majority of patients undergo 18F-FDG PET/CT before treatment initiation. However, radiologists have difficulty distinguishing ADC from SCC based on PET/CT images, and interobserver agreement is usually low. Recent studies have indicated that radiomic features can provide additional useful information based on PET/CT images, reflecting the potential of biological heterogeneity [10,11,12]. There is increasing interest in radiomics, which involves the conversion of medical images into mineable high-dimensional quantitative data. The use of these data to predict treatment responses and patient outcomes has been reported across a range of primary tumours [13, 14]. Combined with machine learning techniques, radiomic features extracted from FDG PET/CT performed better in predicting the progression-free survival (PFS) prognosis of anal squamous cell carcinoma (ASCC) than did conventional staging parameters [15], in addition to performing well in identifying bone marrow involvement (BMI) in patients with suspected relapsed acute leukaemia (AL) [16]. To the best of our knowledge, only one published study has evaluated FDG PET/CT radiomics in NSCLC to discriminate between ADC and SCC. The study applied a linear discriminant analysis (LDA) classifier and three feature selection algorithms [17]. The linear separability of LDA was 0.90; however, as the authors stated, it was a preliminary study that enrolled only 30 patients, and the feature extraction, feature selection and classification methods were simple. A larger-scale comprehensive study is necessary to explore the value of PET/CT imaging for the discrimination of NSCLC subtypes.

In this study, we investigated 10 feature selection methods, 10 ML classifiers and a DL algorithm (VGG16) for differentiating the histological subtypes of NSCLC [18]. These methods were chosen because of their popularity in the literature. The aim of this study was to evaluate whether FDG PET/CT images are powerful for differentiating ADC and SCC and to further search for the optimal model among numerous radiomics-based ML approaches and the DL algorithm. This work serves as a promising diagnostic tool for informing treatment decisions and fostering personalized therapy for patients with lung cancer.

Materials and methods

Patients

This retrospective, single-centre study included two cohorts of patients undergoing pulmonary PET/CT examination between January 2018 and August 2019 at the Department of Nuclear Medicine, Peking University Cancer Hospital. The inclusion criteria were as follows: (1) available pretreatment PET/CT images and (2) available definite pathological diagnosis of ADC or SCC. There were 1419 cases enrolled in the study, with 867 in the ADC cohort and 552 in the SCC cohort. The patients were first split into a training set and a testing set in an 8:2 ratio, and the positive-negative sample ratios in these sets were approximately the same as that in the complete dataset. Then, the training set was used to fit and tune models via tenfold cross-validation, and the testing set was used to evaluate the predictive and generalization ability of the models. The simple statistics of the training and testing set are summarized in Table 1.

Table 1 Demographic characteristics of the patients in the training and testing cohorts

PET/CT acquisition

The PET/CT system we used is an integrated PET and CT machine, a Philips Gemini TF 16 (Dutch Philips Corporation). Before the examination, each patient fasted for more than 6 h, and the blood glucose level was determined by fingertip blood sampling < 10 mmol/L. After 18F-FDG was intravenously injected according to the patient’s body weight (3.0~3.7 MBq/kg), the patient rested calmly. After 60 min, the body was scanned from the cranial to upper thigh regions. Emission collection was conducted from the end of the bed, scanning lasted 1.5 min per bed and 9 to 10 beds per patient. The ordered-subset expectation maximization (OSEM) iterative method was used for PET image reconstruction. The configuration parameters of the transmission scan (CT) were a tube voltage of 120 kV, a tube current of 100 mA and a scan layer thickness of 3 mm. Attenuation correction of the PET images was performed by means of CT data.

Radiomic features

The region of interest (ROI) was segmented from the whole PET/CT image semiautomatically using the region growth method, which was performed by a radiologist with 5 years’ working experience and conducted with MATLAB 2017a (MathWorks, Natick, MA, USA). Level-3 subband of two-dimensional discrete wavelet transform (2D DWT) was conducted on the ROI to obtain eight subband images, and feature extraction was then performed on the subbands. A schematic of the wavelet transform is shown in Supplementary Fig. 1. The acquired imaging features comprised the first-order intensity statistical features and texture features, including the first-order statistics (18 features), grey-level co-occurrence matrix (GLCM, 22 features), grey-level run-length matrix (GLRLM, 16 features), grey-level size-zone matrix (GLSZM, 16 features) and grey-level dependence matrix (GLDM, 14 features) features. A total of 688 features were obtained from each ROI. Feature extraction was performed by a comprehensive open-source platform called PyRadiomics [19], which enables the processing and extraction of radiomic features from medical image data and is implemented in Python. The scoring/selection criteria of the feature selection process are available from the online documentation of PyRadiomics (http://pyradiomics.readthedocs.io/). An overview of the study workflow is illustrated in Fig. 1.

Feature selection methods

The main purposes of feature selection methods are to simplify the model, decrease the computational costs, avoid the curse of dimensionality and enhance the generalization ability of the model [20]. We considered ten feature selection methods that are widely used in the literature: Laplacian score (LS), ReliefF (ReF), spectral feature selection (SPEC), 2,1-norm regularization (2,1NR), efficient and robust feature selection (RFS), multi-cluster feature selection (MCFS), chi-square score (CSS), Fisher score based on statistics (FS), t score (TS) and Gini index (GINI). The first three methods are feature selection methods based on similarity that assess feature importance in terms of the ability to preserve data similarity. The next three methods are based on sparse learning and employ regularization terms to reduce the weights of unimportant features in the model. The last four methods are statistical based methods that rely on various statistical measures to assess feature importance [21]. A feature selection repository of Python named “scikit feature”, which was released by Li et al. [21], was implemented. The web page of the repository is available at http://featureselection.asu.edu/.

Classification algorithms

Classification, a supervised learning task in which function is inferred based on labelled training data [22], is one of the most widely studied areas of machine learning. This study investigated 10 popular classifiers: AdaBoost (AdaB), bagging (BAG), decision tree (DT), naive Bayes (NB), K-nearest neighbours (KNN), logistic regression (LR), multilayer perceptron (MLP), linear discriminant analysis (LDA), random forest (RF) and support vector machine (SVM). Tenfold cross-validation was conducted on the training dataset to evaluate the performance of the models and to identify the optimal hyperparameters, in combination with grid search, after manually setting the bounds and discretization. The AUROC of the estimators, which is suitable for unbalanced classification, was used to evaluate the parameter settings.

In this study, we also considered an end-to-end deep learning approach and compared its classification efficiency with that of the traditional machine learning methods mentioned above. The selected deep learning model was VGG16, which adopts a transfer learning strategy and data argument technique. The pretraining weights were derived from training on the ImageNet dataset. The parameter configurations of each ML algorithm, as well as the data argumentation and transfer learning details of the DL algorithm, are provided in the Supplementary Materials.

Statistical analysis

Statistical descriptions of the demographic characteristics of the training and testing datasets are presented as the mean and standard deviation (SD) or percentage, and the statistical analysis of differences between two datasets was performed using the chi-square test and Student’s t test. The performance of the radiomics ML classifiers and the VGG16 DL algorithm was compared in terms of the AUROC, accuracy, precision, sensitivity (i.e., recall) and specificity. The 95% confidence interval (95% CI) of the AUROC was also calculated based on a binomial exact test [23]. The method of Delong et al. was adopted to conduct pairwise comparisons of the ROC curves [24]. Accuracy was defined as the ratio of the number of samples correctly classified by the classifier to the total number of samples in the testing dataset. Sensitivity was defined as the number of correct positive results that occurred among all ADC samples available during the test. Specificity was defined as the number of correct positive results that occurred among all SCC samples available during the test. Statistical analyses were performed using the “scikit-learn”, “sciPy” and “math” packages in the Python programming language. The 95% CIs of the AUROC were obtained by means of MedCalc statistical software (version 19.0.7, Ostend, Belgium). P values < 0.05 were deemed to be statistically significant.

Fig. 1
figure 1

Workflow of the current study. The inner part of the blue curve is the region of interest (ROI) obtained by the region growing method, and all the pixels outside the blue curve are assigned values of zero and are not involved in the further calculation

Results

Demographic and clinical characteristics of the patients

A total of 1419 samples were collected in this study, and the patients had an average age of 65.20 ± 9.59 years, 473 (33.40%) were female, 338 (23.82%) had a history of smoking, and 300 (21.14%) had metastasis. A stratified random sampling method was conducted to extract 20% (283 samples) of the total sample as the testing set to evaluate the model performance. The remaining 80% were used for training the models and were divided into 10 subsets during cross-validation. The details of the demographic and clinical characteristics of the training and testing cohorts are shown in Table 1.

Performance of radiomics machine learning algorithm

We extracted 688 quantitative features from each of the segmented tumour regions and then applied ten feature selection methods to rank the features in the training set. In the pilot study, we used the 10, 20, 30… 100 top-ranked features obtained by each feature selection method one by one to fit 10 machine learning classifiers. Almost all models achieved the highest AUROC when using the top 50 features. Therefore, we selected the top 50 features of each selection method for our current study. The heatmap in Fig. 2 depicts the AUROC and accuracy on the testing dataset with the optimal hyperparameter configuration. The results for the 30, 40, 60 and 70 top-ranked features are reported in Supplementary Fig. 2, 3, 4, 5 in the Supplementary Material. The white numerals in the grid correspond to the top ten best-performing models. The combined LDA (classifier) and 2,1NR (feature selection method) model achieved the best classification performance, with an AUROC of 0.863 and an accuracy of 0.794. The second-ranked model was the combination of SVM (classifier) and 2,1NR (feature selection method), with an AUROC of 0.863 and an accuracy of 0.792. The mean AUROC and accuracy of all 10 feature selection methods are calculated as the representative AUROC and accuracy for each classifier. Similarly, for each feature selection method, the mean AUROC and accuracy of 10 classifiers are used as the representative AUROC and accuracy. These representative AUROC and accuracy values for the feature selection and classification methods are given in Tables 2 and 3, respectively. The FR classifier and 2,1NR feature selection method showed optimal performance. Figure 3 illustrated the change in AUROC values of the classifiers (LDA, SVM and RF) when using different numbers of top-ranked features selected by 2,1NR on the testing dataset.

Fig. 2
figure 2

Heatmap depicting the differentiating power of machine learning algorithms (in rows) with the AUROC (a) and accuracy (b) based on the 50 top-ranked features of each feature selection approach (in columns)

Table 2 The average AUROC and accuracy of machine learning classifiers
Table 3 The average AUROC and accuracy of feature selection methods
Fig. 3
figure 3

AUROC value of the top-three combined models vs. number of top-ranked features selected via 2,1NR. The dashed curves represent the AUROC of the model on the testing dataset. The relative importance weights of the first 100 features are shown by the stem-and-leaf diagram in the lower half of the figure, corresponding to the right Y-axis

The average absolute values of the correlation coefficients (CCAA) between the top 50 features selected by each feature selection method are listed in Table 3. A lower CCAA denotes less redundant information is included in the selected features. The 50 features selected by the 2,1NR feature selection method have the lowest correlation among features. The matrix diagram of Fig. 4 explicitly illustrates the correlation between each of the 50 features selected by 2,1NR and three other feature selection methods (ReF, RFS and GINI).

Fig. 4
figure 4

Matrix diagrams of the absolute value of the Pearson correlations of the 50 top-ranked features selected by 2,1NR (a), ReF (b), RFS (c) and GINI (d). The numbers next to each matrix diagram indicate the rank of the selected feature; 1 indicates that the corresponding feature is optimal

Performance of the deep learning algorithm

We adopted a deep convolution neural network (DCNN) algorithm, VGG16, to train the classification model on the same training dataset. To improve the performance of the model and accelerate the convergence speed of training, pre-initialization using weights from the same network trained to classify objects in the ImageNet dataset [25], as well as data argumentation on the training dataset, was executed. As expected, the VGG16 deep learning model achieved excellent performance (AUROC, 0.903; accuracy, 0.841) on the testing dataset and outperformed the top-ranked model, namely, the combination of LDA classifier and 2,1NR feature selection method. The detailed network iteration process is illustrated in Fig. 5.

Fig. 5
figure 5

Training curve of the VGG16 classifier. One epoch represents one forward and backward pass of the training dataset through the neural network

Performance comparison of top four models

Table 4 presents a comprehensive performance comparison of the optimal three combinations of classifiers and feature selection methods and the VGG16 model on the testing dataset. Pairwise comparisons of the ROC curves were conducted using the method proposed by Delong et al. [24], and no statistically significant differences were observed. The results are shown in Table 4 and Fig. 6.

Table 4 The comprehensive performance of the top four models on the testing dataset
Fig. 6
figure 6

The ROC curves of the top four models selected from the training phase on the testing dataset

Discussion

Cancer management has entered the era of precision medicine, which relies on validated biomarkers to classify patients with respect to their probable disease risk, prognosis and/or response to the treatment. Therefore, early and accurate subtype diagnosis of lung cancer is particularly important. PET/CT scanning, a well-established hybrid-functional imaging technique, enables non-invasive tumour evaluation for grading, staging and measuring the response to treatment of certain cancers; however, its value in the differential diagnosis of ADC and SCC is limited for radiologists to interpret the images in a routine manner.

In this study, we used machine learning/deep learning algorithms to discover the value of PET/CT images by means of radiomics for the differential diagnosis of ADC and SCC. The results showed that the LDA (AUROC, 0.863; accuracy, 0.794) and SVM (AUROC, 0.863; accuracy, 0.792) classifiers, both combined with the 2,1NR feature selection method, achieved optimal performance compared with other combinations. Furthermore, the VGG16 DL algorithm (AUROC, 0.903; accuracy, 0.841) outperformed all conventional machine learning algorithms with radiomics. To the best of our knowledge, this is the first study to report the potential of PET/CT images with the application of a panel of machine learning/deep learning algorithms for the identification of ADC and SCC.

Radiomics, a young and emerging discipline that bridges the gap between medical imaging and personalized medicine [26, 27], attempts to explore the value of medical images in disease diagnosis, grading and prognosis prediction using medical image analysis technology and machine learning algorithms in artificial intelligence. However, due to the numerous available feature selection methods and ML algorithms [21, 28], the optimal method to use for specific medical images or specific target tasks remains unclear. To determine which feature selection and machine learning algorithms are suitable for the given medical image data, the performance of various feature selection and machine learning algorithms in medical image classification has been studied in recent years [29,30,31,32]. For example, Parmar et al. [33] investigated 14 feature selection methods and 12 classification methods for predicting overall survival of patients with lung cancer, with 440 radiomic features extracted from three-dimensional CT images. The Wilcoxon feature selection approach and its variants showed a higher prediction accuracy than that of the other methods, and the naive Bayes classifier outperformed other classifiers and achieved the highest AUROC (0.72). Zhang B et al. [30] evaluated the performance of 6 feature selection methods and 9 classification methods for the radiomics-based prediction of local failure and distant failure in advanced nasopharyngeal carcinoma. They extracted 970 radiomic features from T2-weighted and contrast-enhanced T1-weighted MRI images of each patient and observed that the combination of the RF classifier and RF feature selection methods yielded the highest prognostic performance.

In the current study, we found that the LDA and SVM classifiers coupled with the 2,1NR feature selection method showed the best performance on our dataset. Unlike the other feature selection methods considered in our study, 2,1NR, RFS and MCFS are embedded methods that embed feature selection into a typical learning algorithm (such as logistic regression). Such methods take into account the correlations between two features. Hence, they can handle redundant feature during the selection phase. By contrast, the other 7 feature selection methods are filter methods. One disadvantage of these methods is that they typically analyse features individually and hence fail to address feature redundancy. The abovementioned research is in agreement with the results of our study. Our findings are consistent with those of Qian [31], who discovered that SVM combined with the least absolute shrinkage and selection operator (LASSO) yielded the highest prediction efficacy in differentiating glioblastoma from solitary brain metastases. Clearly, both LASSO and 2,1NR are sparse-based feature selection methods. According to the literature, RFS is more suitable for multi-label tasks [34, 35], and MCFS is used for unsupervised feature selection [36], which may be the reason why their performance is inferior to that of 2,1NR in this study. The feature analysis also indicated that though radiomics is conducive to obtaining high-throughput features from images, which is accompanied by abundant redundant information. SVM is a robust, powerful and effective machine learning classifier that has been predominately used in the field of radiomics [37, 38]. The results of this study indicate that SVM with a linear kernel is superior to the radial basis function and other kernels. Meanwhile, LDA also performed well in the study. SVM and LDA have an identical function class – the linear decision boundary. A possible explanation for the good performance might be that the data of our study are linearly separable. Another important finding was that the random forest classifier showed good discrimination performance and was the least sensitive to the feature selection methods. These results are in line with those of the previous studies [29, 30, 33]. The random forest algorithm, proposed by L. Breiman [39], has been extremely successful as a general purpose classification method and is easily adaptable to various ad hoc learning tasks. However, it has been observed to have an overfitting problem [39], which was also observed in the current study.

Compared with similar published studies, we found that the optimal ML classifier and feature selection method are inconsistent, which may be due to various reasons, including the differentiation of image modalities, feature extraction algorithms, number of selected features, target task and cohort size. For instance, Zhang Y et al. [32] compared the predictive performance of different combinations of classifiers and feature reduction methods for three clinical outcomes based on the same radiomic features extracted from the same CT image dataset and found that the best model for recurrence is the RF classifier combined with near zero variance (NZV) feature selection; for death, the best model is the NB classifier combined with zero variance (ZV) feature selection; and for recurrence-free survival (RFS), the best model is the mixture discriminant analysis (MDA) classifier without a feature selection method.

In addition to the traditional machine learning algorithms, we also assessed a state-of-the-art DL algorithm. DL algorithms, especially convolution neural networks (CNN), have become the most popular algorithms in computer vision, which is widely used in medical image recognition, target detection, image segmentation and other fields. CNN algorithms are implemented by means of convolution and pooling strategies and are able to simultaneously perform feature construction, feature selection and prediction modelling, essentially performing an end-to-end analysis from inputting raw images to prediction. As such, they are very powerful and labour-saving learning algorithms compared with radiomics. The VGG16 CNN utilized in the current study yielded the best performance among all models. However, as with any other tool, DL algorithms have strengths and limitations. Generally, the DL model needs millions of parameters for training; that is, a large amount of data is needed to train an ideal model. However, medical images are often difficult to collect in large quantities. In addition, a complete theory to explain how the hidden layers that lie between the inputs and outputs is not yet available. The lack of transparency makes it difficult to monitor when model prediction may fail or require troubleshooting [40]. We used dropout layers and data augmentation techniques in training the VGG16 CNN model, which is advantageous for improving the performance of the models and avoiding overfitting, especially for small datasets.

We collected PET/CT fusion images as the image modality in this study. Compared with other modalities, such as CT and MRI, PET has an inherent defect of a low signal-to-noise ratio and resolution and is considered unsuitable for texture feature research. However, tumours are heterogeneous entities at all scales (macroscopic, physiological, microscopic, genetic) [41]. CT and MRI reflect mainly the anatomical structure of tumours, while PET/CT can be used to explore intratumour heterogeneity in both the anatomical and functional dimensions. Texture features calculated from PET modalities have exclusive advantages in reflecting metabolic heterogeneity, which is a recognized characteristic of malignant tumours, presumably linked to basal metabolism, cell necrosis and hypoxia [42]. The results of a prospective study in 54 patients with head and neck cancer demonstrated that some PET texture features could be linked to signalling pathway alterations associated with cell proliferation and apoptosis [43]. In our study, the texture features of PET/CT demonstrated excellent performance in identifying the histological subtypes of NSCLC, which were roughly the same as that of CT images [44].

Our study has several limitations. First, the sample size was small compared with that required for the machine learning methods we used, especially the deep learning algorithm, which may prevent the possibility of establishing better predictive models and is prone to model instability. However, this problem is prevalent in the research on machine learning for medical images [45]. In addition, the images in this study came from a single PET/CT scanner, and a previous study showed that differences in the image acquisition parameters or reconstruction among PET/CT devices affect the extracted texture feature values [46]. Furthermore, we need to collect samples from multiple centres for model training to enhance the robustness and generalization ability of the models. Third, our study used only PET/CT fused images. We may obtain better results if we extract texture features from PET and CT images separately and build models with the merged texture features.

In conclusion, machine learning/deep learning algorithms can be used to differentiate the histological subtypes of NSCLC, namely, ADC and SCC, based on PET/CT images. This work serves as a promising diagnostic tool for informing treatment decisions and fostering personalized therapy for patients with lung cancer in a non-invasive manner.