Introduction

In recent years, the number of identified solid renal masses has significantly increased due to the rapid development of advanced imaging techniques [1], especially in the detection of small renal masses (measuring 4 cm or less) [2, 3]. However, among surgically resected renal masses, approximately 20% are reported to be benign [4], leading to an overall increase in health costs and risks to the patients resulting from overtreatment.

In clinical practice, percutaneous biopsy is the gold standard for differentiating renal masses. However, as it is an invasive method, concern about the high risk of hemorrhage or infection remains [3, 5]. Conventional computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound are used as noninvasive preoperative imaging methods due to their safety and availability. Recently, the functional assessments provided by diffusion-weighted imaging (DWI), arterial spin labeling (ASL), and dynamic contrast enhancement (DCE) imaging have also been considered in the analysis of renal masses [6]. However, these techniques have limited accuracy in the characterization of some benign tumors, such as oncocytomas and fat-poor angiomyolipomas (AMLs), due to their similarity with malignant renal masses in the resulting imaging, limiting their ability to reliably differentiate benign from malignant tumors [7]. In addition, the diagnostic performance is significantly influenced by the experience of the radiologist, which is a major challenge for nonspecialized medical centers.

Machine learning (ML) algorithms have been shown to be valuable in the evaluation of the histopathological characteristics of the disease [8, 9]. Recently, a growing number of studies have shown that ML models have promising performance in predicting the grade of renal tumors and the outcome after renal tumor resection and in diagnosing incidental renal lesions [10,11,12,13], However, these studies are confined either by a small population or a lack of external testing data and contrast-enhanced CT (CECT) examinations. In addition, previous studies have focused little on the influence of the ML model itself on the diagnostic decision-making of the radiologist.

This study aimed to evaluate the performance of a CT-based ML model in discriminating benign renal tumors (including AMLs without visible fat (AMLwvf) and oncocytomas) from common renal cell carcinomas (including clear cell RCCs (ccRCCs), chromophobe RCCs (chrRCCs), and papillary RCCs (pRCCs)) in a large population along with various factor analysis, and to discuss the role that the model plays in radiologists’ diagnostic decision-making in routine clinical practice.

Materials and methods

This retrospective study was approved by the ethics committee of our institution, and the requirement for informed consent was waived because the data were obtained from preexisting institutional and public databases without additional burdens to the patients.

Patients

The institutional pathology database was queried to identify pathologically confirmed renal masses obtained via biopsy or surgical resection in our hospital between 2008 and 2020. The pathological diagnosis was reconfirmed by a pathologist with 10 years of genitourinary experience. The inclusion criteria were as follows: (1) preoperative CT scans with three-phase imaging and (2) a primary lesion that was pathologically confirmed. The exclusion criteria were as follows: (1) significant lesion rupture with abundant hemorrhage leading to obscured tumor features; (2) incorrect delay times after contrast injection on CT study; (3) lesions completely composed of cystic components; and (4) lesions with visible fat on precontrast-phase (PCP) CT. The patient inclusion and exclusion flowchart is shown in Fig. 1.

Fig. 1
figure 1

Flow chart of the study population (*7 patients with more than one lesion)

Finally, 798 patients with 680 RCCs (533 ccRCCs, 69 chrRCCs, and 78 pRCCs) and 125 benign renal lesions (83 AMLwvf and 42 oncocytomas) were included in this study. Among these lesions, 418 RCCs and 78 benign lesions from our institution were included as the training dataset for ML model development and internal validation, and 262 RCCs and 47 benign lesions from two independent institutions and a public database (The Cancer Imaging Archive, TCIA) were included as the external dataset for individual testing.

CT examination

CT examinations were performed in our institution using a 16-detector CT (SOMATOM Sensation 16, Siemens Healthineers), a 64-detector CT (Aquilion 64, Canon Medical Systems), or a dual-source CT (SOMATOM Force, Siemens Healthineers) with the same examination protocol before the patient underwent surgical treatment. The CT parameters and scanning protocol were as follows: tube voltage, 120 kVp; effective tube current-exposure time product, 200–350 mAs; matrix, 512 × 512; and slice thickness, 1.0 or 3.0 mm. Three phases were scanned: the PCP, corticomedullary phase (CMP: 30-s delay after contrast injection), and nephrographic phase (NP: 90-s delay after contrast injection). A total volume of 70–100 mL of contrast medium (Iopamidol, BRACCO) was intravenously injected at a rate of 3.0 mL/s, followed by flushing with 20 mL saline.

Image preprocessing

The image preprocessing steps, which included normalization, pixel resampling, and discretization, were performed during feature extraction for all data. Normalization aims to manage data weight inconsistency. Image normalization was conducted using the z score. The voxel size was defined as 0.8 × 0.8 × 0.8 mm in resampling. Pixel resampling can improve the accuracy and population parameter estimation of the mode, including upsampling and downsampling. Discretization is the classification of continuous features into discrete feature values; a typical application of discretization is the binarization of gray images. A bin width of 25 was used for discretization in our study.

Tumor segmentation

After retrieving and acquiring the images of all patients from our institutional picture archiving and communication system (PACS), we loaded the images into ITK-SNAP [14] (Version 3.6.0) and then anonymized and stored the images in NIfTI format. Spatial matching and segmentation were then performed on the tumor images. Through subtle spatial adjustment on the three-phase images, a preliminarily defined region of interest (ROI) was carefully delineated on each selected slice to cover the whole tumor, by three radiologists with 5, 7, and 10 years of experience in abdominal radiology. The effective margins of the ROIs were reconfirmed by another two senior radiologists with 13 and 15 years of abdominal radiologic experience.

Feature extraction

We chose the PCP, CMP, and NP images for feature analysis, and the corresponding ROIs were determined automatically using the Python (version 3.6.5) package “PyRadiomics” [15]. The extracted texture features included the following: (i) first-order features; (ii) shape features; (iii) gray-level cooccurrence matrix (GLCM) features; (iv) gray-level size-zone matrix (GLSZM) features; (v) gray-level run-length matrix (GLRLM) features; (vi) neighborhood gray-tone difference matrix (NGTDM) features; and (vii) gray-level dependence matrix (GLDM) features [16,17,18,19]. The definition and mathematical formulas for these features have been described previously [20]. By using different filters (e.g., wavelet, Laplacian of Gaussian, square, square root, logarithm, and exponential filters), the final images were obtained. All features apart from the shape features, which are independent descriptors extracted from the label mask, were calculated on both the original and derived images [15]. For the three-dimensional tumor segmentations, we also extracted the texture features mentioned above.

Feature extraction was performed based on the two groups of ROIs by two independent senior radiologists. The selected features that had good to excellent reliability (ICC ≥ 0.75) were included for model development.

ML model

A gradient boosting decision library based on decision trees, CatBoost [21, 22], was used for feature selection and predictive model establishment based on the single and all-phase images. Fivefold cross-validation was performed to evaluate the average value and standard deviation of each performance indicator. Texture analysis and ML were performed using Python (version 3.6.5, www.python.org). For 2D model development, only the largest tumor slice was used. In contrast, every renal tumor slice was included for 3D model development. The tuning parameters were as follows: the learning rate was set to 0.05, and the loss function was logloss. Given the imbalance of the data, the weights of the class were 5:1.

To explore the influence of the segmentation strategy on the ability of the ML model to discriminate malignant from benign tumors, the ROI was expanded or shrunk 1 mm or 3 mm based on the delineated tumor contour. In addition, we tested the performance of the ML model with tumors smaller than 3 cm from the external database as an independent validation group to confirm the practical value of the ML model in identifying small tumors. Figure 2 shows the process of ML model development and validation in our study.

Fig. 2
figure 2

Process flow diagram for developing and validating the CT-based ML model

Subjective radiologist assessment

The radiological analysis was performed by three radiologists in our institution who have interpreted images from more than 2000 urologic cases, all of whom were blinded to the histopathologic data and clinical information.

By using a 10-point scale scoring system, the readers assessed the likelihood for each renal lesion using the following scoring points as described previously: shape (regular = 0, irregular = 1), lesion heterogeneity (homogeneous = 0, heterogeneous = 1), internal septa (absent = 0, present = 1), extrarenal extension (absent = 0, ≤ 50% extension of renal contour = 1, ≥ 50% extension of renal contour = 2), internal calcification (absent = 0, present = 1), internal hemorrhage (absent = 0, present = 1), internal necrosis (absent = 0, present = 1), internal arteries (absent = 0, present = 1), and pseudocapsule (absent = 0, present = 1), which appeared as a hypointense ring around the lesion contour consisting of a fibrous structure that was formed by compression of the distended growth of the renal mass to the surrounding renal parenchyma. Higher scores (≥ 6) indicated a greater possibility of malignancy, and all images were interpreted and scored individually [23,24,25].

Three months later, the readers reviewed all the images alongside the decision of the ML model for each renal lesion. The radiologist assessed the agreement between themselves and that of the model and then made a final decision regarding the related renal masses.

Statistical analysis

Statistical analysis for the performance of the CT-based model was conducted in Python (version 3.6.5, www.python.org), and Sklearn was adopted for index analysis [26]. The evaluation indicators included the true positive rate (TPR), specificity (SPC), positive predictive value (PPV), negative predictive value (NPV), accuracy (ACC), and area under the receiver operating characteristic (ROC) curve (AUC). The mean age was compared between the patients with RCCs and the patients with benign renal tumors using Student’s t test. Chi-square tests were used to compare the male to female ratio between the patients with RCCs and the patients with benign renal tumors. A p value of less than 0.05 was considered statistically significant.

Results

Demographics

Of all 805 renal lesions included in this study, 680 (84.4%) were RCCs, and 125 (15.5%) were benign renal tumors. The median age of patients with RCCs was higher than that of patients with benign renal tumors. The gender structure difference was not statistically significant. Baseline characteristics for each of the groups are presented in Table 1.

Table 1 Baseline characteristics of patients

Three-phase vs. single-phase image models

As demonstrated in Table 2, the machine learning classifier based on all-phase images achieved a higher AUC than the classifiers constructed from each set of single-phase images. In addition, the ML model based on 3D images was slightly superior to that based on 2D images.

Table 2 Performance of the ML classifier in differentiating benign renal tumors from malignant tumors

In the model development based on 3D images, 107 features were extracted from PCP, CMP, and NP CT images, and 321 features were extracted from all-phase CT images. However, 30, 29, 37, and 96 features, respectively, were excluded due to low ICC values (< 0.75). Finally, 77, 78, 70, and 225 features from PCP, CMP, NP, and all-phase CT images, respectively, were selected for ML model development. The top 20 ranking features and the ROC curve of the best model are shown in Figs. 3 and 4.

Fig. 3
figure 3

The top 20 ranking feature scores in the differentiation of RCCs from benign renal tumors (a three-phase; b PCP; c CMP; d NP)

Fig. 4
figure 4

ROC curves of the best machine learning classifier based on various phase images

All renal tumors vs. renal tumors < 3 cm

The machine learning model was externally validated, and the classifier performed better when assessing all tumors than when assessing only tumors measuring < 3 cm; however, acceptable performance was achieved in the latter case (3D image–based model: AUC 0.81 vs. 0.79, ACC 0.86 vs. 0.77; 2D image–based model: AUC 0.75 vs. 0.76, ACC 0.86 vs. 0.75). The performance of the ML model in differentiating renal tumors < 3 cm is shown in Table 3.

Table 3 The performance of the ML model in differentiating renal tumors < 3 cm

Contour vs. noncontour focus

To explore the influence of the segmentation margin, the performances of the ML models based on contour focus, expansions of 1 mm and 3 mm, and shrinkages of 1 mm and 3 mm were compared (Table 4). The performance of the model was not significantly different when the tumor margin was shrunk/expanded by 1 mm or 3 mm than when the margin was unaltered, achieving AUCs of 0.79 (tumor margin shrunk by 1 mm), 0.77 (tumor margin expanded by 1 mm), 0.77 (tumor margin shrunk by 3 mm), and 0.74 (tumor margin expanded by 3 mm).

Table 4 The performance of the ML model based on various segmentation strategies

Radiological interpretation with and without the machine learning algorithm

As shown in Table 5, the radiologists had relatively poor performance compared with the ML classifier, especially in terms of the NPV. Notably, all three radiologists achieved better performance when referring to the machine learning classifier’s decision, especially in terms of the NPV. The AUC of the three readers with and without ML model assistance are shown in Fig. 5.

Table 5 Performance of the three radiologists and their performance with the ML classifier as a reference in differentiating benign renal tumors from malignant tumors
Fig. 5
figure 5

ROC curves of subjective interpretations of radiologists with and without the ML classifier for renal mass discrimination

Discussion

Our study indicates that machine learning could be a reliable and reproducible method for helping distinguish RCCs from benign renal masses. Compared to previous studies, we considered several potential factors, including the tumor size and delineation of the tumor margins, and investigated the influence of these factors on the performance of the ML classifier. Moreover, the ML model may be an optimal complementary assistant for radiologists in differentiating common RCCs from benign renal tumors and may be especially helpful in the detection of benign renal tumors.

In previous studies, Kunapuli et al [27] and Erdim et al [7] compared different ML algorithms in building a prediction model for renal mass differentiation. Sun et al [12] compared the performance of radiomic-radiologic ML models and expert-level radiologists in classifying solid renal masses and found that their optimal model achieved an AUC ranging from 0.83 to 0.92. Xi et al [28] investigated the diagnostic value of a deep learning (DL) model based on MRI data, and this model achieved higher performance values than four expert radiologists. All these studies indicate that radiomics could be a promising method for differentiating common RCCs from benign renal tumors, which is consistent with the results of our study, but the performance of these ML or DL approaches is questionable given the loss of information from entire volumetric lesion images or the lack of consideration for certain pieces of radiologic identification such as size and shape margin. Therefore, in our study, we trained our ML model using features derived from both 2D and 3D images. We also assessed the changes in the ML model performance during clinical application with regard to tumor size and segmentation strategy.

Comparing the performance of the model based on features from different CT phase images, we found that the model based on whole-tumor features derived from the three-phase CT images had better performance in renal lesion classification. This may be because the all-phase images and the whole tumor pattern provided more information about the lesion to the model.

Regarding the lesion volume, Tanaka et al [29] investigated the performance of a CNN-based deep learning model in classifying small renal tumors (less than 4 cm), and they found that the deep learning model based on CMP CT images performed better, achieving an ACC of 0.88. In our study, the size of renal tumors varied, and tumors less than 3 cm were selected from the external database to test the ML model. The limited information provided by tumors measuring less than 3 cm reduced the renal mass differentiation accuracy of the model, especially for the single-CMP-image model. As expected, fewer characteristic features can be extracted from smaller lesions, making differentiation more difficult. However, we found that the ML model performance was still acceptable. In routine image interpretation, making an accurate diagnosis for incidental renal tumors measuring less than 3 cm is more important and more difficult than it is for larger renal tumors. Our study showed that the ML model is a practical technique for aiding radiologists in clinical practice, especially for identifying smaller renal tumors.

In addition, the segmentation margin remains a challenge in ML model development. Although manual delineation was considered the standard reference, the tumor contour might be ill-defined in some cases and lead to inaccurate delineations. Few studies have discussed the influence of the segmentation margin on model performance. In our study, the performance of the model was not influenced substantially by shrinking/expanding the margin of the tumor by 1 mm or 3 mm. We assumed that the features subsequently lost or added by the slight changes in the tumor margin comprised a small proportion of all features extracted from the whole tumor; in other words, the features that we actually acquired were sufficient for estimating the homogeneity or heterogeneity of the tumor. Thus, in similar studies, we may not have to focus excessively on the subtle variance in ROI delineation.

In our study, the accuracy of the machine learning classifier was slightly higher than that of the three radiologists, while after considering the decision of the machine learning model, the three radiologists had a higher accuracy than the model, which shows the usefulness of the model in the identification of renal masses. In addition, the NPV (0.82–0.99) of the ML model complemented the PPV (0.91–0.95) of the radiologists, potentially leading to perfect complementary assistance.

One limitation of our study that should be noted is the imbalanced nature of the dataset (RCCs:benign renal tumors = 680:125). To address the adverse impact of the imbalanced dataset on the performance of the classifiers, a synthetic minority oversampling technique (SMOTE) was adopted for the sample generation of the minority group from the joint weighting of the optimal features. As a result, the representation of the minority benign renal tumor group (AMLwvf and oncocytoma) increased, and the balance of the dataset improved, followed by improvement of the performance indicators (with an AUC of 0.97, an ACC of 0.93). However, the statistics generated by SMOTE were less authentic due to the inevitable overfitting of these algorithms. To some extent, the imbalanced data reflect the real incidence of malignant and benign renal tumors. In addition, only one algorithm was used for the establishment of ML in our study, which is also a potential weakness. Thus, a large-scale, multicenter and multialgorithm study is necessary to further validate our study.

Conclusion

Our study shows that a machine learning classifier based on texture features derived from whole-tumor three-phase CT images can be a useful and promising technique for differentiating RCCs from benign renal tumors, which also contributes to the identification of small renal tumors. Furthermore, the ML model perfectly complemented the radiologists’ interpretations and could be useful in improving performance, especially in the precise diagnosis of benign renal tumors.