Introduction

Renal cell carcinoma (RCC) is the most common kidney malignancy and accounts for almost 90% of renal cancers [1,2,3,4]. Clear cell renal cell carcinoma (ccRCC) makes up about 70% of RCCs and has poorer prognosis than other subtypes [1,2,3,4]. Nuclear grade is established as an independent histological prognostic factor and is significant for the clinical management of ccRCC [5]. Numerous grading systems have been applied in the pathological nuclear grading of ccRCC. Among them, the Fuhrman nuclear grading system is most widely used [6]. There are 4 nuclear grades (1–4) based on nuclear size, irregularity, and nucleolar prominence [6]. Grades 1 and 2 indicate low-grade tumors with better prognosis while grades 3 and 4 indicate high-grade tumors with poor prognosis [6, 7].

Preoperative prediction of nuclear grade is crucial for personalized treatments like partial nephrectomy, radical nephrectomy, or active surveillance [8,9,10,11]. Percutaneous renal tumor biopsy is an invasive procedure for preoperative determination of nuclear grade [12]. However, it is insensitive in 2.5–22% of the cases [5, 13] and may cause complications. Thus, novel noninvasive methods are needed to overcome this drawback.

Radiomics modeling is a promising tool for guiding clinical decisions through quantitative evaluation of medical images [14,15,16]. Various studies have sought to quantify radiomics features for stratification of low and high Fuhrman nuclear grades [17, 18]. However, this is hampered by a lack of standardized RCC lesion segmentation [19]. Various tumor delineation strategies, like adhering to the visible lesion edge, shrinking the margin to a certain amount to account for partial volume effect or volume averaging, or dilating the margin to include peritumor perirenal fat, have been proposed [20,21,22]. The impact brought by using different definitions of tumor volume of interest (VOI) for nuclear grading is still unclear. Here, we comprehensively investigated the potential influence of different tumor VOI delineation methods on radiomics-based models to discriminate low- from high-grade ccRCC using dynamic contrast-enhanced CT (CECT).

Materials and methods

Patients

The study was approved by our local institutional review board, and the requirement for informed patient consent was waived due to the study’s retrospective nature. Data were collected through an electronic search of the picture archiving and communication system (PACS) from January 2011 to January 2019. Inclusion criteria were (1) pathologically proven ccRCC with defined Fuhrman grade and (2) preoperative examination with four-phase CECT scans. Exclusion criteria were (1) cases of purely cystic ccRCC, (2) ccRCC without Fuhrman grade, and (3) prominent CT artifacts. The study workflow is shown on Fig. 1.

Fig. 1
figure 1

Study workflow. UP, unenhanced phase; CMP, corticomedullary phase; NP, nephrographic phase; EP, excretory phase

Fuhrman stage and image acquisition

To ensure reproducibility of pathological diagnosis and reduce intra/inter-observer variability, the traditional 4-tier Fuhrman grading system was re-categorized into a simplified Fuhrman grading system with low grade (grades 1 and 2) and high grade (grades 3 and 4). Fuhrman grading was done by a specialized genitourinary pathologist (W. S. Ding) with 9 years of experience.

Preoperative CECT images were acquired on Toshiba Aquilion One, Siemens Somatom Definition, GE HiSpeed 16, and Philips Brilliance 64. Acquisition parameters were as follows: 120–140-kV tube voltage, automated tube current modulation, and varied milliampere-second settings. All patients were injected with nonionic intravenous contrast material (1 mL/kg body weight, maximum volume = 150 mL) through the antecubital vein using mechanical power injectors. All patients underwent preoperative four-phase CT scans—phase 1: unenhanced (UP); phase 2: postcontrast corticomedullary phase (CMP); phase 3: postcontrast nephrographic phase (NP); and phase 4: postcontrast excretory phase (EP).

Segmentation

All retrieved CT images were stored in anonymized DICOM format. ITK-SNAP software (http://www.itksnap.org) was used to delineate target 3D VOI on the CT slice of the tumor in phases 1–4 for tumor segmentation. First, a contour-focused lesion VOI was manually delineated on the NP and then applied to the other 3 phases with slight adjustments tailoring VOIs to each phase. Next, a larger VOI containing perirenal fat and peritumoral renal parenchyma was generated by dilating the contour-focused original VOI by ~ 1 cm. This process was not entirely isotropic as the dilation would stop when encountering the bowel, liver, spleen, adrenal gland, vasculature, lymph nodes, adjacent visceral, or muscular tissue. Subtraction of the 2 VOIs yielded a loop VOI (VOI_loop), which was automatically divided into 2 parts of perirenal fat (VOI_loop_fat) and peritumoral renal parenchyma (VOI_loop_margin) using a predefined Hounsfield unit (HU) threshold of − 20. The VOI_loop_fat (< − 20 HU) or the VOI_loop_margin (> − 20 HU) was post-processed by removing isolated parts and filling small cavities. Manual segmentation was done by 2 investigators without prior knowledge of the lesions’ pathology (S.W. Luo and R.L. Wei, with 4 and 5 years of experience in radiological diagnosis, respectively). Conformity of the delineated VOIs was measured using Dice similarity coefficient. For those CT slices with Dice indexes > 0.9, the unanimous segmentation was the intersection of the two individual segmentations, while for those slices with Dice < 0.9, discrepancies on lesion boundary were resolved by further discussions to reach consensus.

Different VOI delineation strategies

Based on the original contour-focused VOI and loop VOI, erosion and dilation procedures were done slice-by-slice on the aforementioned VOIs to simulate delineation uncertainties. The VOI was dilated by 2, 4, and 6 mm but still within the scope of VOI + VOI_loop (yielding the VOI_d2, VOI_d4, VOI_d6), VOI + VOI_loop_fat (yielding the Fat_d2, Fat_d4, Fat_d6), or VOI + VOI_loop_margin (yielding the Margin_d2, Margin_d4, Margin_d6). Tumor VOI was eroded by 2 and 4 mm to yield VOI_e2 and VOI_e4. Together with the original contour-focused VOI, 12 types of VOI delineations were obtained for subsequent radiomics analysis.

Feature extraction and representation

Texture feature extraction was done on each of the 12 VOIs in each phase using Pyradiomics [23]. The extracted features on each phase included 107 candidate features that can be categorized into 3 subtypes, including the shape, first-order statistics (histogram analysis), and second-order statistics (“texture features”). Features extracted on each of the 4 phases included \({F}_{pha}^{1}\), \({F}_{pha}^{2}\), \({F}_{pha}^{3}\), and \({F}_{pha}^{4}\) (each with 107 features) and were termed group 1. Concatenation of features of any 2 phases were termed as group 2, including \({F}_{pha}^{\mathrm{1,2}}\), \({F}_{pha}^{\mathrm{1,3}}\), \({F}_{pha}^{\mathrm{1,4}}\), \({F}_{pha}^{\mathrm{2,3}}\), \({F}_{pha}^{\mathrm{2,4}}\), and \({F}_{pha}^{\mathrm{3,4}}\) (each with 214 features). Concatenation of features of any 3 phases were termed group 3, including \({F}_{pha}^{\mathrm{1,2},3}\), \({F}_{pha}^{\mathrm{1,2},4}\), \({F}_{pha}^{\mathrm{1,3},4}\), and \({F}_{pha}^{\mathrm{2,3},4}\) (each with 321 features). Concatenated features of all 4 phases were termed as group 4, i.e., \({F}_{pha}^{\mathrm{1,2},\mathrm{3,4}}\) (with 428 features). Discriminative capabilities were respectively compared using the above 180 (12*15) types of features as input for a specific discrimination model.

Modeling and comparisons

We used 22 feature selection methods and 8 classification algorithms to build a total of 176 (22 × 8) discrimination models. The above 180 types of features were fed into each of the 176 discriminative models, resulting in 31,680 (180 × 176) combinations for comparison. We evaluated each of these models with five fold cross-validation, in each of which an optimal subset of features (20, 40, 60, and 80 features for groups 1–4, respectively) was first estimated by a specific feature selection method. The prescreened features were then fed into a classifier for discrimination modeling. To ease data imbalance of the patient cohort, the synthetic minority oversampling technique (SMOTE) [24] was used to oversample the minority high-grade ccRCC group by introducing synthetic feature samples. The discrimination powers of the models were quantified using area under the receiver operating characteristic (ROC) curve (AUC), accuracy (ACC), sensitivity (SEN), and specificity (SPE).

Statistical analysis

Continuous variables are reported as mean ± SD. Categorical variables are reported as numbers and proportions. Normality of the data distribution was assessed by the Kolmogorov–Smirnov test. Comparisons between groups were done using the chi-square test for categorical variables, the independent t-test for normally distributed continuous variables, and the Mann–Whitney U test for non-normally distributed continuous variables. Discriminative comparisons between the 15 types of features were done using the independent samples Kruskal–Wallis test with Bonferroni correction to adjust significance level in pairwise comparisons. All statistical analyses were done on SPSS version 20 (IBM). Two-tailed p < 0.05 was considered statistically significant.

Results

Demographics

The study cohort comprised of 124 low-grade (17 were grade 1 (9.6%) and 107 were grade 2 (60.5%)) and 53 high-grade (40 were grade 3 (22.6%) and 13 were grade 4 (7.3%)) ccRCC patients who met the inclusion criteria (Table 1). The groups did not differ significantly with regard to age, sex, or lesion diameter (p > 0.05). Imaging and histological results from two representative patients are provided in Fig. 2.

Table 1 Demographics and characteristics of the study cohort
Fig. 2
figure 2

Representative examples of clear cell renal cell carcinoma (ccRCC). a–e Low-grade (Fuhrman nuclear grade 2) ccRCC in a 49-year-old man. a–d Unenhanced phase (UP), corticomedullary phase (CMP), nephrographic phase (NP), and excretory phase (EP) CT images (red arrows point to the tumor in the left kidney). e Histologic photomicrograph (hematoxylin–eosin, H & E stain). f–j High-grade (Fuhrman nuclear grade 3) ccRCC in a 69-year-old man. f–i UP, CMP, NP, and EP CT images (red arrows point to the tumor in the right kidney). j H & E stain

Discriminative capabilities of different feature types

The 12*15 feature types were compared by being fed to each of the 176 discrimination models. The running time of the established models ranged between 0.02 and approximately 90 s, with a mean time of ~ 8.9 s. Figure 3a shows the boxplot of the AUC distributions achieved by the 176 discriminative models for 15 phase-based feature types based on contour-focused VOI. Statistical comparisons revealed that \({F}_{pha}^{1}\) outperformed other single phases. The phase combinations including phase 1, e.g., \({F}_{pha}^{\mathrm{1,2}}\), \({F}_{pha}^{\mathrm{1,3}}\), \({F}_{pha}^{\mathrm{1,4}}\), \({F}_{pha}^{\mathrm{1,2},3}\), \({F}_{pha}^{\mathrm{1,3},4}\), \({F}_{pha}^{\mathrm{1,2},4}\), and \({F}_{pha}^{\mathrm{1,2},\mathrm{3,4}}\) showed superior performance than those without phase 1, e.g., \({F}_{pha}^{\mathrm{2,3}}\), \({F}_{pha}^{\mathrm{2,4}}\), \({F}_{pha}^{\mathrm{3,4}}\), and \({F}_{pha}^{\mathrm{2,3},4}\) (Table 2). The highest AUC (0.87) was obtained using \({F}_{pha}^{\mathrm{1,4}}\), with the discriminative models of combination of “Random Forest” and “CIFE”. Furthermore, \({F}_{pha}^{\mathrm{1,4}}\) was the most frequent (36 times) feature type and was ranked as the best feature, followed by \({F}_{pha}^{\mathrm{1,2},4}\) (27 times) and \({F}_{pha}^{\mathrm{1,2},\mathrm{3,4}}\) (27 times). Figure 3b shows the boxplot of the AUC distributions for 12 VOI delineations based on \({F}_{pha}^{1}\). There were no significant differences between the original contour-focused VOI and VOI_e2, Margin_2, and Margin_4 (Table 3). Significant inferior performances were seen in other VOIs relative to the original VOI. A summary of the highest performance within each of the 12 VOI types, including the specific phase the feature extracted from, and the classifier and feature selection method used is shown on Table 3. The model built with “Random Forest” and “CIFE” based on \({F}_{pha}^{\mathrm{1,4}}\) and VOI performed best (ACC 81%, SEN 67%, SPE 87%, AUC 0.87), followed by the model built with “Random Forest” and “MIFS” based on \({F}_{pha}^{\mathrm{1,4}}\) and Margin_d2 (ACC 78%, SEN 61%, SPE 86%, AUC 0.87).

Fig. 3
figure 3

Boxplots of the AUC distributions achieved by the 176 discriminative models, categorized by a 15 phases (features extracted on VOI) or b 12 VOIs (features extracted on \({F}_{pha}^{1}\)). The boxes run from the 25th to 75th percentile; the two ends of the whiskers represent the 5% and 95% percentiles of the data; the horizontal line and the square in the box represent the median and mean values, respectively. Diamonds represent outliers. Letters above each box in a indicate statistical significance (Kruskal–Wallis test with the Bonferroni correction) between any two features types. No common letters indicate that the two feature types are significantly different

Table 2 The best performances achieved in 15 phases (features extracted on the contour-focused VOI)
Table 3 Best performances in each of the 12 VOIs (achieved with the associated phases, classifier, and feature selection), and the number of times they ranked as best feature. p < 0.05 was considered statistically significant (Mann–Whitney U test)

Key feature analysis

AUC values obtained by all 176 discriminative models using features from \({F}_{pha}^{1}\) and \({F}_{pha}^{\mathrm{1,4}}\) as feature input were visualized on a heatmap (Fig. 4). The highest AUCs for \({F}_{pha}^{1}\) and \({F}_{pha}^{\mathrm{1,4}}\) were 0.83 and 0.87, respectively. For all discriminative models with AUC > 0.80, we counted the number of times each feature in \({F}_{pha}^{1}\) had been selected in the top-20 features in the fivefold cross-validation (Fig. 5). The top-10 most frequently selected features in \({F}_{pha}^{1}\) are highlighted as blue in Fig. 5 (two features ranked 10th) and summarized in Table 4, including 7 shape features and 4 first-order statistics features. No texture features were included. Of the top-10 features, all 4 first-order statistical features (2 with p < 10−3 and 2 with p < 10−5) and 3 shape features (least axis length with p = 0.0074, surface volume ratio with p = 0.0046, elongation with p = 0.0249) were statistically significant features.

Fig. 4
figure 4

Heatmap of the AUC values obtained by the 176 discriminative models (a, \({F}_{pha}^{1}\) as feature input; b,\({F}_{pha}^{\mathrm{1,4}}\) as feature input) built with different combinations of classifiers and feature selection methods (features extracted on VOI)

Fig. 5
figure 5

Histogram showing the number of times of each feature in \({F}_{pha}^{1}\) being selected as the top 20 features (features extracted on VOI)

Table 4 The top-10 most frequently selected features in \({F}_{pha}^{1}\). Their corresponding mean (± SD) values (or median (IQR)) and p values between low and high Fuhrman grades are shown

We estimated the capability of using the mean of the mean feature values of the 2 groups (i.e., “M” in Table 4) as threshold to differentiate the 2 groups. It was observed that the first-order statistical features, i.e., the median, the 90th percentile, the 10th percentile, and root mean squared, demonstrated good discriminative capabilities in which about 70% of the high-grade group had larger feature values and about 65% of the low-grade group had smaller feature values.

Discussion

In this retrospective study, we explored the influence of different VOI delineation strategies on radiomics modeling for the discrimination of low and high nuclear grade ccRCC with dynamic CECT. Experimental results demonstrated that the \({F}_{pha}^{1,4}\)-based model achieved the best performance compared with other phasic combinations. The discrimination model based on “Random Forest” and “CIFE” yielded satisfactory performance (ACC 81%, SEN 67%, SPE 87%, AUC = 0.87) with radiomics features extracted from \({F}_{pha}^{1,4}\) based on tumor contour-focused VOI. Furthermore, VOI, VOI_e2, Margin_d2, and Margin_d4 exhibited similar performances and could act as references for tumor segmentation (manual or automatic) to minimize the influence of segmentation uncertainties on performance variations in nuclear grade stratifications of ccRCC.

Currently, radiomics analysis is used to help distinguish high-grade from low-grade ccRCC. However, most studies have analyzed features from a single phase or incomplete contrast-enhanced phases. For example, Betkas et al.[17] evaluated the performance of 2D portal-phase CT texture features combined with different ML-based classification schemes in discriminating low and high nuclear grade ccRCCs. The best model was created using SVM with overall ACC, SEN, SPE, and AUC (for detecting high-grade ccRCCs) of 85.1%, 91.3%, 80.6%, and 0.860, respectively. Lin et al.[25] established machine learning models based on single- or three-phases (pre-contrast phase, corticomedullary phase and nephrographic phase) CT images to differentiate low- and high-grade ccRCC. The best diagnostic performance was observed when using all three-phase CT images (AUC = 0.87), followed by single-phase NP (AUC = 0.84), CMP (AUC = 0.80), and PCP (AUC = 0.82) images. However, some valuable information might have been missed since unenhanced phase or excretory phase was not considered in that study. In our study, EP (\({F}_{pha}^{4}\)) exhibited good performance that was only slightly inferior to the UP (\({F}_{pha}^{1}\)) among single phases, and their combination into \({F}_{pha}^{1,4}\) achieved the best performance over all phasic combinations. The findings by Kocak et al.[18] on the role of unenhanced CT in differentiating nuclear grade are consistent with ours. The possible explanation might be that the Fuhrman nuclear grade (nuclear size, irregularity, and nucleolar prominence) correlated more with heterogeneity of the tumor itself rather than tumor perfusion or vascularity.

Lesion segmentation is a critical procedure that might have substantial impact on the performance of radiomics analysis. [19] Kocak et al. [26] used 47 cases to assess the influence of segmentation margin on radiomics performance and found that contour-focused segmentation (AUC = 0.865–0.984) performed better than models with a lesion margin shrinkage of 2 mm (AUC: 0.745–0.887, p < 0.05). The finding was inspiring but this analysis was conducted only on the CMP. Gill et al.[22] assessed if juxta-tumoral perinephric fat (JPF) may aid in machine learning–based nuclear grading of RCC. The CT-based texture analysis of ccRCC showed statistically significant differences in JPF adjacent to high- versus low-grade ccRCC. Here, we carried out a more thorough investigation to determine the impact of different VOI delineation strategies on pathological nuclear grading, including dilation and erosion of contour-focused VOI. For a better understanding of the role of the peritumor components in nuclear grading, we divided peritumor components into peritumor renal parenchyma and peritumor perirenal fat. Our results showed no significant differences between the original VOI and the 2-mm erosive VOI, as well as the 2- or 4-mm margin dilated VOI. In particular, the 2-mm erosive VOI exhibited similar performance with the contour-focused VOI, which differs from past findings [26]. The peritumor renal parenchyma did not interfere with the discriminative performance within the range of 4-mm extension from original VOI, probably because higher grade ccRCCs tend to invade adjacent renal parenchyma and tumor-renal parenchyma interface within a certain range may reflect the biological behavior of tumors. However, relative to original VOI, VOI delineations inclusive of the perirenal fat showed inferior performance, suggesting that peritumor perirenal fat might not provide additional information compared to the tumor itself and conversely may weaken a model’s differentiation capability. These findings may serve as references for determining RCC lesion segmentation tolerance for radiomics analysis.

It is interesting to find that the top-10 features were shape-based features (n = 7) and first-order features (n = 4) and that no texture features were significant. Of the 7 shape-based features, the least axis length, the surface volume ratio, and the elongation exhibited statistically significant differences (p < 0.05) between the 2 groups. The low-grade group had lower values of least axis length but higher values of surface volume ratio and elongation relative to the high-grade group, indicating that low-grade tumors are more inclined to be round. Yap et al.[27] compared the relative contributions of shape and texture metrics in differentiating benign from malignant renal masses and found that shape metrics alone have high prediction performance (AUC 0.64–0.68). Our data also reveal a positive role of shape features in ccRCC nuclear grading. The four top first-order features (median, 90th percentile, 10th percentile, and root mean squared) showed significantly higher values in the high-grade group (p < 10−3). This is attributable to larger nuclear sizes and more prominent nucleolar appearances in high-grade tumors.

This study is limited by the relatively small sample size and lack of an external validation set. This is attributable to our strict inclusion criteria that required that all 4 phases of contrast-enhanced CT to be available for all patients. However, we have to admit that radiomics applications continue to suffer from poor external validation due to inter-institutional variations in CT protocoling and workflow—both of which have shown strong implications in affecting the overall generalizability. In our study, we employed patient data from four different CT scanners, and such data heterogeneity, to some extent, has compensated for the limitation of no external validation. Another limitation is the use of Fuhrman nuclear grading system instead of the latest WHO/ISUP grading system [28,29,30,31,32,33]. This is because the included cases date back from 2011 when the Fuhrman grading system was widely used. Furthermore, the use of manual segmentation is time-consuming and automatic segmentation is necessary.

In conclusion, machine learning–based radiomics analysis on UP and NP outperformed other phases. There were no statistically significant performance differences between tumor VOI and VOI eroded by 2 mm or dilated by 2 or 4 mm within peritumor renal parenchyma, which may act as reference for manual or automatic segmentation tolerance for radiomics-based modeling of ccRCC nuclear grading.