Introduction

Preoperative differentiation between muscle-invasive bladder cancer (MIBC) and non-muscle-invasive bladder cancer (NMIBC) is crucial for subsequent treatment options. Transurethral resection (TUR) is usually chosen as the initial treatment for superficial tumors, whereas muscle-invasive tumors are treated with radical cystectomy (RC) or with adjuvant chemotherapy [1, 2]. However, 27–51% of NMIBC diagnosed by TUR are upstaged to MIBC at RC [1, 3,4,5], indicating its relatively low sensitivity for discriminating muscle-invasive tumors. Despite the advances in endoscopic [2] and the availability of sophisticated predicting tools [3, 6,7,8], accurate assessment of the clinical stage of bladder cancer (BC) is still challenging.

Magnetic resonance imaging (MRI) allows for differentiation of the bladder wall layers [9, 10]. Multiparametric MRI, including diffusion-weighted imaging (DWI), has shown promise for assessing depth of invasion in BC [11,12,13]. Radiomics converts medical images into mineable high-dimensional data by means of feature engineering and machine learning techniques [1415]. Radiomics has been used to facilitate clinical decision-making in glioblastoma, lung cancer, and other solid tumors [16,17,18], and has shown its ability for preoperative prediction of tumor grading and lymph node metastasis in BC [19,20,21]. Recently, radiomics signature derived from T2WI and DWI showed potential for the differentiation of muscle invasion in BC [22, 23]. However, the sample size was relatively small and the result of TUR was not included or compared with the radiomics approach.

Thus, with a larger sample set and the result of TUR, this study aimed to develop and validate a more sensitive radiomics model from DWI for discriminating muscle-invasive bladder cancer.

Materials and methods

This study had institutional review board approval, and informed consent was waived due to its retrospective nature.

Patient population

Consecutive BC patients treated between July 2014 and December 2018 were included, according to the following criteria: (1) underwent both TUR and RC at our institute and were confirmed to have high-grade urothelial carcinoma, as almost all muscle-invasive tumors are high grade [2]; (2) delay between TUR and RC was less than 12 weeks, and absence of neoadjuvant chemotherapy or radiotherapy before RC; (3) available MRI before biopsy through cystoscopy or TUR, meaning MRI for an intact tumor. Patients were randomly divided into training set and validation set.

TUR followed by pathology investigation of the obtained specimen was a diagnostic procedure and initial treatment step. For small papillary tumors (< 1 cm), resection was performed in one piece including the part from the underlying bladder wall. For tumors > 1 cm in diameter, resection was performed in fractions including the exophytic part of the tumor, the underlying bladder wall with the detrusor muscle, and the edges of the resection area. Cauterisation was avoided as much as possible during TUR to avoid tissue deterioration. The specimen obtained by TUR was investigated by close cooperation between urologists and pathologists. The pathology report should specify tumor grade, depth of tumor invasion, presence of carcinoma in situ (CIS) or histological variant, and whether the detrusor muscle is present in the specimen. Papillary tumors confined to mucosa (Ta) or invading the lamina propria (submucosa) (T1) were classified as NMIBC. MIBC was confirmed when tumor invaded the detrusor muscle, including irregular nests, single cell infiltration, or tentacular finger-like projections.

At our institute, indications for RC included clinical MIBC and highest-risk NMIBC. Clinical highest-risk NMIBC was defined as T1HG (high grade) with any one of the following conditions or TaHG with any two: multifocal, large (> 3 cm), recurrent, associated with concurrent CIS, mixed histological variant, and BCG failure.

MR imaging

MRI including DWI for bladder was performed using a 3.0-T MR scanner (Ingenia; Philips Healthcare) with a Torso 32-channel phased array coil and without breath-holding. Parameters of DWI with single-shot EPI (echo-planar imaging) sequence were as follows: FOV, 260 × 284 × 105 mm; matrix, 132 × 170 × 32 slices; slice thickness/gap, 3/0.3 mm; TR/TE, 8216/67 ms; flip angle, 90°; number of excitations, 2; EPI factor, 71; bandwidth, 16.6 Hz; two b values (b = 0, and 1000 s/mm2); directions of motion-probing gradients, 2; fat suppression, spectral attenuated inversion recovery; and total scan duration, approximately 2 min 50 s. Corresponding ADC maps were then automatically calculated voxel by voxel by solving the following equation:

$$ S\left(\mathrm{b}1000\right)/S\left(\mathrm{b}0\right)=\exp \left(-\mathrm{b}1000\times \mathrm{ADC}\right) $$

where S(b1000) and S(b0) represent the signal intensity of a certain voxel in the presence and absence of diffusion sensitization, respectively.

Qualitative MRI evaluation

Invasion of muscular layer was evaluated on DWI together with T2-weighted images independently by two radiologists, according to the criteria described in [9]. Briefly, a high signal intensity tumor with a low signal intensity submucosal stalk or a thickened submucosa on DWI (b = 1000 s/mm2), or an intact low signal intensity muscle layer on T2-weighted images indicated the absence of muscle invasion. For patients with multiple tumors, the one with the highest stage was documented.

Tumor segmentation

One radiologist manually segmented the entire tumor area on DWI (b = 1000 s/mm2) using an open-source software package (ITK-SNAP, version 3.4.0; http://itk-snap.org) to yield volume of interest (VOI). The VOI was copied to corresponding ADC map for computer-based analysis. After 3 days, the segmentation was repeated on 40 patients by the same radiologist and by another radiologist for assessing intra- and inter-observer repeatability.

Feature extraction

The first-order intensity features, high-order texture features, and shape features were extracted within the VOIs using an in-house Matlab program (R2016a, Mathworks Inc.). The high-order texture features were extracted using several different methods, including the gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), gray-level size zone matrix (GLSZM) and neighborhood gray-tone difference matrix (NGTDM) methods. Finally, for each tumor, 156 quantitative features were extracted. Each feature was normalized into its Z-score.

Feature selection, and radiomics model development

Feature selection was assumed to serve as a dimension-reduction tool and discover features that may provide deeper insight to the classification task. First, intra- and inter-observer repeatability for each imaging feature was measured by intraclass correlation coefficient (ICC). Features with ICC of more than 0.85 were selected to build a classification model using random forest (RandomForest model, RF) for discriminating muscle-invasive bladder cancers. The tree number of the random forest classifier was set to 400. Mean Decrease in Gini index (MDGini) was used as variable importance measure.

For comparison, we used a random-forest based wrapper algorithm, Boruta, to select all-relevant imaging features. It evaluates feature relevance by comparing the importance of original features with that achieved by artificially added random features. Random forest is performed iteratively to measure feature importance, while irrelevant features are discarded progressively. To reach statistical significance, the algorithm repeatedly calculates all possible feature combinations, generating an all-relevant subset of features. Based on the selected all-relevant features, another random forest model (all-relevant model, AR) was built.

Combination model development

Three combination models were built. First, the result of TUR was combined with RF model and AR model, respectively, yielding two combined models. When muscle invasion was confirmed at TUR, the case was recognized as muscle-invasive regardless of the result of radiomics model. Meanwhile, if the bladder cancer was identified as non-muscle-invasive at TUR, the final result was determined based on radiomics model. For comparison, another model combining the results of TUR and qualitative MRI evaluation was also built according to the rules mentioned above.

Statistical analysis

All statistical analyses were performed using R-3.4.4 (https://www.r-project.org). All predictive models were trained on the training data set and tested on the independent validation data set. Discrimination performances were evaluated with area under the receiver operating characteristic (ROC) curve (AUC), accuracy (ACC), sensitivity (SEN), specificity (SPE), and F1 and F2 scores (F1 score is the harmonic average of the precision and recall, F2 score weighs recall higher than precision). In all tests, muscle invasion was regarded as the positive result. Delong’s test was used for comparing AUC, and McNemar’s test for comparing ACC, SEN, and SPE between the two models. Inter-observer repeatability for qualitative MRI evaluation was measured by Kappa value. The R packages RandomForest and Boruta were used for model building and feature selection. All p values were two-sided. A p value < 0.05 was considered significant.

Results

Patient population

Two hundred and forty-five patients were included. After excluding 37 patients in whom radiomics features could not be extracted due to the small volume of lesions or the limited visibility of images, 218 (169 males; mean age, 66.1 years [range, 37–93]; 141 muscle-invasive tumors) were left for further analyses. In this patient group, TUR only confirmed 87 muscle-invasive tumors, and 38.3% (54/141) of RC-confirmed muscle-invasive tumors were misdiagnosed as non-muscle-invasive tumors at TUR (Table 1, Fig. 1).

Table 1 Baseline characteristics of the patients
Fig. 1
figure 1

Bladder MRI demonstrate a mass on the right wall of the bladder in a 79-year-old man with painless hematuria. The detrusor muscle layer seems to be intact on T2WI (a), and a low signal intensity thickened submucosa is observed on DWI (b, b value = 1000 s/mm2), indicating the absence of muscle invasion. High-grade urothelial carcinoma staged T1 associated with concurrent carcinoma in situ is diagnosed at transurethral resection, and stratified as highest-risk non-muscle-invasive bladder cancer. Subsequently performed radical cystectomy confirmed the presence of muscle invasion

Patients were randomly divided into training set (131 patients; 104 males; mean age, 65.8 years [range, 38–86]; 86 muscle-invasive tumors) and validation set (87 patients; 65 males; mean age, 66.5 years [range, 37–93]; 55 muscle-invasive tumors) (Fig. 2). No significant difference was observed in age (p = 0.696, Wilcoxon rank sum test), gender (p = 0.519, chi-square test), or muscle invasion (p = 0.824, chi-square test) between the two sets (Table 1).

Fig. 2
figure 2

Study flowchart

Radiomics and combination model development

Seventy-three features with ICC of more than 0.85 were extracted by different methods, including first order, shape, GLCM, GLRLM, GLSZM, and NGTDM features. After Boruta selection, 21 all-relevant features were obtained (Table 2) (Figs. 3 and 4). Internal validation showed no significant difference in AUC (0.907 vs 0.904, p = 0.673, Delong’s test), ACC (0.839 vs 0.816, p = 0.480, McNemar’s test), SEN (0.873 vs 0.855, p = 1.000), or SPE (0.781 vs 0.750, p = 1.000) between RandomForest model and all-relevant model for discriminating muscle-invasive BC (Table 3) (Fig. 5).

Table 2 A summary of 73 radiomics features with ICC of more than 0.85, and 21 all-relevant features (bold and italic) selected using Boruta
Fig. 3
figure 3

Radiomics workflow

Fig. 4
figure 4

Heatmap for normalized feature value distribution of the extracted 73 features (above) and the 21 all-relevant features (below) between superficial and muscle-invasive bladder cancers

Table 3 A summary of the performances on validation set of RandomForest model, all-relevant model, transurethral resection, MRI, and combination models for discriminating muscle-invasive bladder cancer
Fig. 5
figure 5

ROC curves of RandomForest and All-relavant models for discriminating muscle-invasive bladder cancer on the validation set

RandomForest model was more sensitive than TUR (0.873 vs 0.655, p = 0.019, McNemar’s test), and MRI (0.873 vs 0.764, p = 0.181) for discriminating MIBC, but the difference did not reach statistical significance. When combining the RandomForest model with TUR, the sensitivity increased to 0.964, significantly higher than TUR (0.655, p < 0.001), MRI (0.764, p = 0.006), and the combination of TUR and MRI (0.836, p = 0.046). Notably, the combination model (RandomForest model and TUR) had the highest accuracy of 0.897 and F2 score of 0.946 for discriminating MIBC (Table 3).

Discussion

In this study, 38.3% (54/141) of RC-confirmed muscle-invasive tumors were misdiagnosed as non-muscle-invasive tumors at TUR, which is consistent with previous reports [1, 3,4,5]. Many reasons account for the poor sensitivity of TUR for discriminating muscle-invasive tumors, such as sampling error due to incompleteness of TUR, delay in the interval from TUR to RC, and poor sensitivity of preoperative staging tools [1, 3]. Besides, qualitative MRI evaluation only showed a good inter-observer repeatability (Kappa value = 0.605) and a poor sensitivity comparable to that of TUR (0.764 vs 0.873, p = 0.181), although substantial advances in DWI have been reported to make multiparametric MRI a feasible and reasonably accurate technique to optimize the treatment of BC [9, 24].

The discrepancy between previous studies [9] and ours may be explained by the following reasons: (1) in previous report, the sample size was relatively small and the distribution of superficial and muscle-invasive tumors was uneven, leading to potential miscalculation of ACC, an imperfect evaluation index for classification performance; (2) as the authors mentioned, in cases that had underwent management before MRI, inflammatory changes due to prior treatment or biopsy may affect the results of MRI evaluation; (3) in previous report, not all patients underwent RC, and clinical stage cannot be regarded as the reference standard in the radiologic-pathologic correlation analyses due to its poor sensitivity; (4) muscle layer is usually depicted as a thin line with low signal intensity and difficult to distinguish from surrounding fat tissue on DWI. Muscle invasion can only be definitely excluded when an obvious submucosal stalk or thickened submucosa is present; otherwise, subjective judgment may lead to substantial misdiagnosis rate and poor inter-observer repeatability.

New post-processing and functional multiparametric MRI have shown promise for assessing depth of invasion in BC [11,12,13]. However, it is challenging to acquire images with satisfactory spatial resolution using diffusion tensor imaging (DTI) or diffusion kurtosis imaging (DKI), and these novel imaging techniques are not routinely performed in clinical practice.

Generally, there are two types of imaging features, the semantic features and the radiomics features. Semantic features are more familiar to radiologists and are commonly used to describe lesions like signal intensity or enhancement characteristics. Radiomics features are mathematically extracted quantitative descriptors, which are generally not part of the radiologists’ lexicon. These features capture microscale information embedded within images, but not visible by the naked human eye [1415]. Our radiomics model exhibited favorable discrimination performance in internal validation, with an AUC of 0.907 on the test set. The obvious advantage of TUR is its specificity of 100%, as muscle invasion is confirmed once observed at TUR specimen without considering the pathological result at RC. But for detecting highly malignant muscle-invasive BC, what physicians most importantly need is a more sensitive staging tool with a false negative rate as low as possible altogether with a relatively high positive predictive value (PPV). Recall (sensitivity) is more important than precision (PPV). Considering that F1 score is the harmonic average of the precision and recall, and that F2 score weighs recall higher than precision by placing more emphasis on false negatives, our radiomics model and combination model showed improved performance for discriminating muscle-invasive BC compared with TUR and qualitative MRI evaluation as seen on Table 3.

Another major finding of this study was that a small subset of all-relevant radiomics features selected by Boruta exhibited an equivalent performance compared to that of all the extracted features, although the classification performance using the selected optimal feature subset outperformed that using the candidate feature set in a previous report [19]. Feature selection is an important and necessary step, as it makes the model simpler and easier to interpret. When acquiring enormous amount of data (“high-dimensional”), there is an exponentially increasing risk of sparsity and loss of efficacy of traditional clustering algorithms. Feature selection addresses this issue and enhances generalization by reducing overfitting. The central premise when using a feature selection technique is that the data contains some features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information [25]. Our finding suggested that radiomics data contained redundant or irrelevant features and that feature selection should be performed in building radiomics models.

Our study had several limitations. For cases with multiple tumors, we only documented the one with the highest stage for radiologic-pathologic correlation analyses. Although each tumor was respectively analyzed in previous report [9], our method was closer to clinical practice. Incorrect manual segmentation, because either of the small volume of the lesions or of the limited visibility of the images, may lead to poor repeatability of feature extraction. So some ineligible cases were excluded. Moreover, external validation for the radiomics model was not performed. In the future, multicenter validation with a larger sample size is needed to acquire high-level evidences.

In conclusion, a radiomics model from DWI was more sensitive and accurate than TUR and could help for discriminating muscle-invasive bladder cancer in clinical practice. Multicenter, prospective studies are needed to confirm our results.