Introduction

Incidental findings of cystic renal masses (CRMs) are common on abdominal computed tomography (CT) images [1]. Majority of these lesions are benign, but some may be malignant. Bosniak classification [2] stratifies these lesions in five classes: Bosniak I and II are almost always benign, while Bosniak IIF, III, and IV can be potentially malignant with 10–20% [3], 50% [4], and 90% [5] malignancy rate, respectively.

While “Bosniak Classification version 2019” [2] aims to be more objective and reduce the inter-reader variability in characterization of CRMs, it still remains subjective. A further reduction in inter-reader variability can be obtained with the use of quantitative features extracted from the imaging data, referred to as radiomic features. Radiomics coupled with machine learning algorithm has been an emerging research area aiming to complement radiologists’ visual interpretation, enhance the overall diagnostic and prognostic accuracy, and help reducing the inter-reader variability [6,7,8].

In a previous study, a texture-based machine learning algorithm was created to potentially reduce inter-reader variability in distinguishing benign from potentially malignant CRMs on CT images [9]. A short list of 1st-order radiomic features (mean, standard deviation, entropy, skewness, and kurtosis) was employed, given their strengths in straightforward implementation and known low variability among different software packages [10]. 1st-order features are based on the histogram of the pixel intensity values within a region of interest (ROI) and are independent of the spatial distribution, unlike the higher-order features varying with spatial distribution of the pixel values [11].

In this work, as an extension to the previous study, we investigate the effect of inter-reader variability in delineating the ROI, which is considered one of the most important issues in radiomics and thus has been extensively studied for various disease cases and imaging modalities [12,13,14,15]. In addition, we investigate whether the inclusion of higher-order features can further improve the performance of the machine learning algorithm in classifying benign from potentially malignant CRMs, compared to the use of 1st-order features alone. For this purpose, we evaluate the performance of a set of robust features, determined based on the intraclass correlation coefficient (ICC), which is a widely used metric to assess the feature reliability among the readers [16].

Materials and methods

Patient cohort and demographics

In this IRB-approved and HIPAA-compliant retrospective study, we included 230 cystic renal masses (161 benign and 69 potentially malignant) with diameters larger than 1 cm derived from our prior study on the use of CT texture analysis for stratification of cystic renal masses into benign and potentially malignant lesions per Bosniak classification v2019 [9]. All exams included 3 mm sections reconstructed with a 50% overlap before and 100 s (nephrographic phase) after IV administration of 50–150 ml iodinated contrast material (300–370 mg iodine/ml). Two radiologists independently and manually delineated 2D ROIs (referred to as R1 and R2) of the CRMs at an axial CT slice at the same level used in the prior study. The Bosniak class of all lesions was previously established independently by three blinded radiologists [9].

Synthetic ROI

To increase the number of observers, we generated two synthetic ROIs (Syn1 and Syn2) for each of the original ROI (R1 or R2) by modifying them with the following steps:

Syn1: Random fluctuation + 4-pixel dilation + 3-pixel erosion

Syn2: Random fluctuation + 4-pixel dilation + 5-pixel erosion

Here, the first step introduces random variation about the original contour in the vertical direction. The noisy contour is then smoothed and filled up by applying a 4-pixel dilation. Finally, the application of 3-pixel and 5-pixel erosion operations creates on average slightly larger (Syn1) and smaller (Syn2) synthetic ROIs compared to the original ROI, respectively. Figure 1 illustrates these steps applied to a sample ROI. As a result, a total of 4 synthetic ROIs were obtained in addition to the two original ROIs. The synthetic ROIs derived from R1 are named as R1_Syn1 and R1_Syn2. Similarly, R2_Syn1 and R2_Syn2 were derived from R2.

Fig. 1
figure 1

Image operations for generating the synthetic ROIs. The white contour is the original ROI drawn over the cystic renal mass. The green and blue contours overlaid on these images show the preparation steps of the new syn1 (Top) and syn2 (Bottom) ROIs, respectively. First, the original ROI is altered with random fluctuations resulting in noisy contours (Left). Next, 4-pixel dilation operation reduces the noise (Middle). Last, the erosion operations (Right) bring the size of the ROIs closer to the original ROI, where 3-pixel and 5-pixel erosion operations produce slightly larger and smaller regions, respectively

To measure similarity and to assess how well the synthetic ROIs represent the degree of inter-reader variability, we used dice coefficients (Dcoeff), similar to the approach presented in [15]. The Dcoeff for a pair of ROIs is calculated as the ratio of the overlap area (Ao) to the total area (At) multiplied by two (i.e., Dcoeff = 2 \(\times\) Ao/At). We first obtained the Dcoeff between R1 and R2 to establish a baseline for the expected degree of inter-reader variability. Then, we compared the Dcoeffs of the synthetic and original ROIs to this baseline.

Radiomic features

We used an open source Matlab code by Vallières et al. [17] to extract a total of 169 features. 160 of these were higher-order features from gray-level co-occurrence matrix (GLCM), gray-level run-length matrix (GLRLM), gray-level size zone matrix (GLSZM), and neighborhood gray-tone difference matrix (NGTDM) for two different quantization levels (8 and 16 Gy-levels) and for two different quantization algorithms (equal and uniform). The remaining 9 were 1st-order features (size of the major axis, total number of pixels, solidity, eccentricity, mean, standard deviation, entropy, skewness, and kurtosis).

Robust features intraclass correlation (ICC)

We computed the ICC (single fixed raters) among the 6 observers (2 original and 4 synthetic) using the Pingouin python package. We considered the features that provided ICC > 0.85 as robust features, which was defined as “good reliability” in [16].

Building the machine learning model

For the classification of the cystic renal masses, we used a random forest decision tree-based machine learning algorithm (sklearn.ensemble.RandomForestClassifier), which is among the most commonly used classifier in radiomics field because of its favorable features such as computational efficiency, stability against data perturbation [18], and superior performance [19, 20]. We applied a tenfold stratified cross-validation (CV) to train and test the performance of the model. In this method, all the operations were strictly separated between the training and test datasets. First, the data were split in 10 randomly selected groups (folds). Then, ninefold of the data were selected to build the machine learning model and the remaining onefold was used to test the performance of the model. This process was repeated ten times so that all individual folds are used for the test. The steps in each CV iteration were as following: (1) selection of the robust features based on the ICC > 0.85 criterion; (2) data normalization by removing the mean and scaling to unit variance; (3) removal of highly correlated features (Pearson Coefficient > 0.85); (4) fivefold hyperparameter grid-search to tune the model training parameters including size and depth of the decision trees; (5) fit the model using tuned parameters from step 4 on all ninefold training data; and (6) test the performance of the model on the onefold test data to compute the quantitative metrics: area under curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The test results from these 10 iterations were then aggregated to obtain the mean ± standard deviation (SD). We also repeated these steps (except for the steps 1 and 3) using the nine 1st-order features only.

Results

Table 1 summarizes the dice coefficient percentages obtained between different pair of observers. The Dcoeff% for the original ROIs, R1 and R2, was 88.1 ± 6.7. The Dcoeff% among synthetic and original ROIs were close to this value indicating that the synthetic ROIs provided sufficient variation in contours mimicking the degree of variation seen among readers. Figure 2 presents six sample ROIs visually confirming realistic variations among original and synthetic ROIs. The dice coefficients between R1 and R2, Dcoeff(R1,R2), are presented in increasing order from left (61%) to right (95%) at the top row. The middle row shows the inter-reader variability between R1_Syn1 and R1_Syn2 and the bottom row shows the inter-reader variability between R2_Syn1 and R2_Syn2.

Table 1 Dice coefficient percentages among the original and synthetic ROIs
Fig. 2
figure 2

The original ROIs (R1 and R2) delineated by the two radiologists (Top) and the synthetic ROIs derived from R1 (Middle) and R2 (Bottom) are shown, respectively, for six sample CT studies. The corresponding dice coefficients are listed on the top left corner of each subfigure

Robust features

We used a total of 169 features in this study (36 GLCM + 52 GLSZM + 52 GLRM + 20 NGTDM + 9 first order) as listed in five separate feature groups in Fig. 3a with their ICC values (3 GLZSM and 2 GLRM features with negative ICC values were not listed). 78 of these were considered as robust features with ICC > 0.85 (16 GLCM + 26 GLSZM + 23 GLRM + 7 NGTDM + 6 first order). Note, for the purpose of this figure only, the ICC values are calculated from the entire dataset without using the CV folds. Hence, the ICC values in each CV fold can be slightly different than the ones shown in this figure. The number of features was further reduced when highly correlated features were dropped based on Pearson correlation threshold of 0.85. For example, for R2, the remaining total features were 22 (2 GLCM + 7 GLSZM + 8 GLRM + 0 NGTDM + 5 first order) as listed in Fig. 3b. In Fig. 4, the ICC distributions are summarized in box plots, where the median of the 1st-order features have higher ICC values. Figure 5 shows the feature value heatmaps for the robust and uncorrelated features obtained from the 2 original and 4 synthetic ROIs. R1 and R1_Syn1 had the lowest (17) and highest (25) number of features, respectively (R2 had 22 features as mentioned above). The feature values for the benign and potentially malignant cases are well separated in these heatmaps indicating good classification performance and reliability across the six sets of ROIs.

Fig. 3
figure 3

ICC distributions of the 164 features (3 GLZSM and 2 GLRM features with negative ICC values were not listed) are shown in the left panel (a) where the features with ICC > 0.85 (ICC = 0.85 threshold is indicated by the red dashed lines) are considered as robust features in this study. Note, for the purpose of this figure only, ICC values were computed from the entire dataset without using the CV folds. Gray scale levels (8 and 16) and quantization algorithms (equal and uniform) are denoted in feature names. Robust (ICC > 0.85) and uncorrelated features (Pearson correlation < 0.85 based on R2) are listed in the right panel, where all NGTDM features were eliminated (b)

Fig. 4
figure 4

ICC distribution of all the feature groups. Mean (green triangle), median (orange line), and outliers (open circles) are shown

Fig. 5
figure 5

Heat maps showing the robust (ICC > 0.85) and uncorrelated (Pearson < 0.85) clustered feature values for the malignant (potentially) and benign cases for the original (R1 and R2) and synthetic ROIs. Because of the clustering, the feature names are not listed in the same order among different heatmaps. The horizontal axis showing the 230 ROIs was kept in the same order for all heatmaps. Also, the number of features varied among the ROI-groups (range 17–25 features) because of the differences in correlations within each group. The two cases are in general well separated for all six sets of ROIs

Performance of the machine learning algorithm

The performance metrics for the robust (ICC > 0.85) and uncorrelated (Pearson < 0.85) features (ranging from 17 to 25 features total) as well as for the 1st-order features (9 features total) averaged over 6 observers (2 original and 4 synthetic ROIs) obtained with the random forest classifier are summarized in Table 2 in terms of mean ± SD. Figure 6 shows the receiver operating characteristic (ROC) curves obtained from the 6 observers, each averaged over 10 CV folds. Figure 7 shows the first 10 most important features from these CV folds.

Table 2 Quantitative performance
Fig. 6
figure 6

Receiver operator characteristic (ROC) curve of the two original and four synthetic ROIs when all the robust and uncorrelated features (Left) and when 1st-order features alone were used. The figure legends show the area under the curve AUC (mean ± SD) values averaged over 10 CV folds for each observer

Fig. 7
figure 7

The feature importance score for the ten most important features obtained from the total of 60 CV folds. Based on this metric, Entropy and Mean are the most important two features for the classification of cystic renal masses

Discussion

We presented the use of a random forest machine learning algorithm for the classification of benign from potentially malignant CRMs seen on CT images. For this purpose, we studied the inter-reader variability in delineation of ROIs and determined stable (robust) 1st- and higher-order radiomic features under this variability. We then evaluated the classification performance of these robust features in comparison to the use of 1st-order features obtained from original and synthetic sets of ROIs.

Generating synthetic ROIs from the original ROIs is a common practice in radiomics as manual segmentation of large datasets is labor-intensive and time-consuming. This is typically performed through simple image operations such as smoothing, dilation and erosion [13, 21], and isotropic resizing [22, 23] over the original ROIs, initially created by one or few readers. In a comprehensive study, Zwanenburg et al. [24] investigated 18 combinations of image perturbations to determine feature robustness, based on noise addition, translation, rotation, volume resizing, and contour randomization. Many of these methods, however, may not capture the inter-reader variability in a realistic manner and may be highly correlated with the original ROIs as the overall shape remains very similar under these operations. An extensive study by Joscowicz et al. [25] has well illustrated the actual degree of inter-reader variability based on 11 radiologists’ manual delineation of lesion (brain hematoma) and organ (brain, lung, liver, and kidney) contours in CT. Therefore, in contrast to many prior studies, our synthetic ROIs were generated through a combination of random fluctuations, dilation, and erosion operations over the original ROIs, which realistically captured the inter-reader variability as presented quantitatively in Table 1 and qualitatively in Fig. 2. For comparison, we performed a sub-study by generating three more synthetic ROIs by applying simple 1-pixel, 2-pixel, and 3-pixel dilation operations on R1. The average dice coefficients between “R1 and dilation of R1”, DCoeff(R1, dil_R1), were 93% (for 1-pixel), 86% (for 2-pixel), and 81% (for 3-pixel), respectively. These results indicate quantitatively that 1-pixel and 3-pixel dilations do not capture the level of variability recorded between R1 and R2 (DCoeff(R1,R2) = 88%). While the average DCoeff with the 2-pixel dilation appears to be reasonable quantitatively, it does not represent a realistic variation visually as it follows the same contour (slightly larger in every directions) of the original ROI.

By computing ICC of these datasets (two original and four synthetic ROIs), we were able to determine the robust radiomic features from a larger sample, which improves the reliability of feature selection. We listed these features using the entire dataset in Fig. 3a. This list, however, was not directly used in the training data to prevent information leakage between the training and test data [26]. Instead, we performed ICC computations for robust feature selection and all the other operations strictly within each cross-validation fold.

The robust features extracted from these ROIs were used in the random forest algorithm for the classification of benign and potentially malignant CRMs, which resulted in high AUC (0.87), sensitivity (0.82), and specificity (0.90) averaged over six sets of ROIs as shown in Table 2. The quantitative results were considerably affected by different delineations among the six datasets. For example, AUC ranged for the robust and uncorrelated features from 0.83 ± 0.09 to 0.93 ± 0.04 and for the 1st-order features from 0.84 ± 0.09 to 0.91 ± 0.04.

As the AUC ranges show, the inclusion of higher-order features for this specific study did not improve the performance compared to the usage of the 1st-order features alone. In fact, a review paper by Traverso et al. [27] reported that the 1st-order features were overall more reproducible among the 41 research studies they reviewed. The same publication also stated that entropy was among the most stable 1st-order features, which was also the case in this study. As shown in Fig. 7, the entropy was the most important feature followed by the mean. Another advantage of using 1st-order feature is their simple implementation and low variability among different software packages.

The quantitative results presented here were overall consistent with the results of a previous study applying 1st-order features on a similar dataset [9]. One exception to this was the sensitivity, which was considerably higher in this study (0.82) compared to the previous one (0.67). This difference is likely due to the exclusion of renal masses smaller than 1 cm in size in the present study as they are often too small to fully characterize on CT and typically considered clinically inconsequential [28]. Also, the differences in the 1st-order feature list and steps in the train/test workflow (e.g., grid-search, inclusion of higher-order features, and ICC-based robust feature selection) may have contributed to the differences in results between the two studies. Another important difference between the two studies was the inclusion of a new set of ROIs (R2) created by a second radiologist and four more synthetic ROIs derived from R1 and R2 to investigate the inter-reader variability in delineating the ROI.

While the random forest algorithm used in this study demonstrated high discriminatory capability in distinguishing benign from potentially malignant CRMs, slight improvements in these results can be expected with the inclusion of features that were not considered here and/or with the usage of different machine learning algorithms. However, in a sub-investigation we performed, we were not able to improve the results significantly by a different machine algorithm (e.g., Gradient Boost), inclusion of more correlated features, and/or expanding the grid-search (unpublished data).

Another potential limitation was the usage of 2D ROI drawn on a single axial slice, which may not fully capture the information for the Bosniak classification compared to a volumetric approach using 3D ROI. Most current radiomics research, however, is based on single slice for practical reasons as 3D approach is more time-consuming and complicated. Additionally, there appears to be no strong evidence in literature indicating that the classification based on 3D method is better than that of 2D methods [29,30,31,32]. While typical 2D approach selects the center slice to represent the entire mass, in this work, the slice with the highest Bosniak class features was selected to better capture the characteristics of the cystic renal mass.

Conclusion

We presented the level of inter-reader variability in delineation of CRMs and its effect on the classification performance. Using random forest algorithm, we obtained high sensitivity and specificity in the prediction of the benign and potentially malignant cystic renal masses. We also demonstrated that the usage of the 1st-order features alone is sufficient for the classification of cystic renal masses and the inclusion of higher-order features does not necessarily improve the performance.