While prostate cancer (PCa) is the most commonly diagnosed non-cutaneous cancer in men in the United States, the large majority of patients die of other causes [1]. Indeed, it is estimated that only 25% of PCa cases would be detected clinically without prostate specific antigen (PSA) screening [2]. Given the high prevalence and relative indolence of most cases of PCa, accurate risk stratification is of paramount importance in order to avoid overtreatment. This need is particularly critical considering the substantial morbidity associated with radical prostatectomy (RP), including high rates of impotence and incontinence in some studies [3]. While the Gleason score (GS) determined at RP is the gold standard for predicting mortality following RP [4], methods are needed to reliably determine the GS prior to RP in order to provide an accurate prognostic assessment and select an appropriate management regimen, for instance between RP and active surveillance. Current alternative risk stratification tools include a combination of laboratory results, clinical findings, and biopsy GS, although these have shown moderate accuracy at best [57]. Furthermore, the central role of biopsy in such prognostic schemes necessitates a degree of invasiveness, with associated cost, patient discomfort, and risk of complications.

Diffusion-weighted imaging (DWI) is an MRI technique that has been established to improve tumor detection and localization [8]. Furthermore, an inverse correlation between apparent diffusion coefficient (ADC) values derived from DWI and GS has been well described [923]. This association suggests that quantitative ADC values could potentially serve as a prognostic biomarker for PCa treatment selection, and improve upon both the suboptimal accuracy and invasiveness of current risk stratification strategies. However, the extent of the relationship between ADC values and GS has varied between studies, and a greater understanding of this association is warranted in view of the growing interest in both the urologic and radiologic communities in applying ADC values in clinical settings [24]. While past meta-analyses have explored the role of qualitative DWI for tumor detection, no meta-analysis to our knowledge has examined the use of quantitative ADC values for risk stratification. Therefore, the purpose of this study was to determine the performance of quantitative ADC values for separating high-risk from low-risk PCa, as well as to identify factors influencing this performance.

Methods

Literature search

A single radiologist (HS, a fellow in abdominal MRI) performed searches of the MEDLINE and EMBASE databases to identify relevant articles. The “PICO” question to be addressed by the review was [25] (P) Patients with prostate cancer; (I) Intervention—quantitative ADC measurements obtained from diffusion-weighted MRI; (C) Comparison—none; (O) Outcome/Target Condition—presence of high- or low-risk tumor based on histopathologic findings from either prostate biopsy or radical prostatectomy serving as reference standard. The searches were conducted on December 11th, 2015. The specific search terms are provided in the Appendix in the Electronic Supplementary Material. The study protocol was not registered.

Study selection

The retrieved abstracts were reviewed by the previously noted investigator (HS). Full texts were retrieved for those articles that represented original research and reported diagnostic accuracy of ADC values for separating high-risk from low-risk prostate cancer. To be included in the meta-analysis, the full-text had to then provide sufficient data for constructing a 2 × 2 contingency table. If multiple eligible studies reported overlapping patient cohorts, then the larger study was included. For studies lacking data for construction of the 2 × 2 table but otherwise meeting inclusion criteria, efforts were made to contact the study authors and obtain these data. Articles in non-English languages were translated to English to allow for further assessment.

Data extraction

The included articles were reviewed by two investigators (HS and AR; AR is a fellowship-trained body radiologist with 8 years of experience). The relevant data were independently extracted and differences resolved by consensus. The extracted study data included covariates to be analyzed in meta-analysis, all represented in binary form, as well as other study characteristics.

The selected covariates included the following: Year of publication (classified as before or after January 1, 2014), MRI field strength (classified as 3T for all patients vs. other), GS threshold for differentiating low-risk from high-risk prostate cancer in the study’s 2 × 2 data, use of an endorectal coil, use of RP as reference standard in all patients, whether all patients had a known diagnosis of prostate cancer at the time of MRI, number of tumor foci per patient contributing to 2 × 2 data (classified as 1, vs. more than 1 tumor focus, per patient), ADC metric selected for the 2 × 2 data (classified as mean ADC vs. other metric, regardless of inclusion of b0 in ADC computation), maximal b value used for the computation of ADC metric within 2 × 2 data (classified as whether or not at least 2000s/mm2), prostate zone studied (classified as whole gland vs. peripheral zone), number of readers placing regions of interest (classified as 2 readers vs. 1 reader), reader experience (classified as whether at least one reader had at least 5 years of experience), and use of anti-peristaltic agent (classified as used in some or all patients vs. not used in any patients).

Additional extracted study characteristics included the country of publication, number of overall patients with tumors contributing to the 2 × 2 contingency table, total number of tumor foci evaluated in the 2 × 2 table, patient age (as reported in the article), and PSA (as reported in the article). In studies involving more than one reader, the reported interreader variability was also recorded, when available.

For each study, a 2 × 2 contingency table was constructed using the extracted data for the performance of ADC values for discriminating high-risk from low-risk prostate cancer. If multiple diagnostic accuracies were reported for ADC values derived from different b value combinations or for different ADC metrics, then the b value combination or ADC metric which yielded the highest diagnostic performance (defined as the average of sensitivity and specificity) was selected for inclusion in the meta-analysis. Diffusional kurtosis metrics, when provided in addition to conventional ADC metrics, were not selected. If diagnostic accuracy was reported for separating tumors of varying combinations of GS, then the data for separating GS 3 + 3 tumor from higher GS tumors were selected. In studies in which diagnostic accuracies were reported for multiple observers, the raw 2 × 2 data of the observers were averaged. If the provided sensitivities and specificities did not yield whole numbers for the 2 × 2 data, then attempts were made to contact the authors for clarification, and the data were fit to the published sensitivities and specificities when unsuccessful. Additional comments regarding derivation of the 2 × 2 data for included studies are detailed in Table 1. The end result was a single 2 × 2 table per study for the meta-analysis.

Table 1 2 × 2 data of included studies separating high-risk from low-risk prostate cancer with ADC values as well as comments regarding derivation of the 2 × 2 data during full-text evaluation

Methodological quality assessment

Overall methodological quality and likelihood of bias of the articles included in the meta-analysis were assessed for risk of bias by using the revised tool for the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) [26]. The questions evaluated for each study are detailed in Table 2. A maximum of 6 months was considered as an acceptable time interval between MRI and biopsy, given the very slow growth rate of PCa with a doubling time of 2 years [27, 28].

Table 2 QUADAS-2 questions evaluated for included studies

Given the expectation of heterogeneity among studies of diagnostic accuracy, the inability of conventional tests for publication bias to handle threshold effects between sensitivity and specificity, as well as uncertainty regarding whether publication bias in fact exists for studies of diagnostic accuracy, tests for publication bias, including the Deeks method, are not recommended for meta-analyses of diagnostic tests and were therefore not performed [29, 30].

Statistical analysis

Following computation of the sensitivity and specificity of each individual study for discriminating high-risk from low-risk prostate cancer using an ADC cutoff, the data were pooled and analyzed in sum using the bivariate random-effects model of Reitsma et al. [31]. The sensitivity, specificity, positive likelihood ratio, and negative likelihood ratio across all studies were computed. A summary receiver operating characteristic (sROC) curve was produced and used to calculate the area under the curve (AUC) [32]. A hierarchical summary ROC (HSROC) model was not used given the inability of such a model to allow for separate assessments of study covariates’ impact on test sensitivity and specificity, prior studies suggesting similar results in the sROC when using a bivariate random-effects model vs. HSROC model [33], and initial exploratory assessment of our pooled data prior to covariate analyses showing essentially identical sROC curves between the two models (not published).

The presence of heterogeneity between studies was assessed for sensitivity and specificity through the Cochran’s Q test (χ2 > df, p < 0.1: heterogeneity is present) [34] and τ2 statistic (quantifies between-study variance when >0) [35, 36]. In the case of a direct correlation between the sensitivity and false-positive rate, a threshold effect to explain interstudy heterogeneity was considered likely, whereby apparent differences in performance are explained by a tradeoff between specificity and sensitivity. Sensitivity and specificity were calculated separately for subgroups defined by the prespecified covariates using a bivariate random-effects model. Metaregression was performed of pairwise combinations of selected study characteristics observed to significantly impact study performance at univariable assessment. Statistical analysis was performed using the MADA (“Meta-analysis of diagnostic accuracy”) package within the R software environment (R Foundation for Statistical Computing; Vienna, Austria; http://www.R-project.org) [37, 38]. A p value of <0.05 was considered statistically significant.

Results

Study selection

The two database searches resulted in 617 abstracts (Fig. 1). 488 of these were excluded at the abstract level (Fig. 1), leaving 129 articles reviewed at the full-text level (Appendix in Electronic Supplementary Material). An additional 116 articles were subsequently excluded (Fig. 1). The remaining 13 studies were included in the meta-analysis [921]. One of the included studies [19] was published in Chinese, with an English-language abstract.

Fig. 1
figure 1

Flow chart of study inclusions and exclusions for the meta-analysis. *Exclusion refers to a study that only enrolled patients with pathologically proven intermediate or high-risk PCa (36), and to a study that only included patients eligible for a strictly defined active surveillance protocol, thereby excluding patients from study based on a broad range of Gleason scores (37)

Summary of included studies

The 13 included studies are summarized in Tables 3, 4, and 5. Nine studies were published in 2014 or 2015 [1321]. The MRI field strength was 3T in seven studies [11, 12, 15, 1821] and 1.5T in the remaining studies. Ten studies classified GS ≥ 7 tumors as high risk, while 3 studies used a higher GS threshold to define high risk: GS ≥ 3 + 4+5 [12], GS ≥ 4+3 [13], and GS ≥ 8 [16]. In 4 studies, an endorectal coil was used [9, 10, 16, 17]. The reference standard was RP for all patients in 7 studies [10, 1215, 17, 21], biopsy for all patients in 5 studies [9, 11, 16, 19, 20], and a combination of RP and biopsy in 1 study [18]. In 8 studies, all patients had known PCa at the time of MRI [912, 1618, 21]. There was more than one tumor focus per patient contributing to the 2 × 2 data in 9 studies [911, 13, 1619, 21]. In 4 studies, an ADC metric other than mean ADC was selected as providing highest accuracy: normalized ADC in two studies [15, 16], minimum ADC in one study [13], and 10th percentile ADC derived from an ADC histogram in one study [21]. Five studies used a b value greater than or equal to 2000s/mm2 for calculating ADC [11, 13, 18, 20, 21]. Eight studies included lesions from the whole prostate gland [9, 1318, 21], while the remaining 5 studies only included peripheral zone lesions. Two readers were involved in placing the ROIs in 10 studies [1012, 14, 1621], while one reader was involved in ROI placement in the remaining studies. An expert reader was involved in measuring ADC values in five studies [9, 12, 14, 15, 18], while no expert reader was involved in four studies [11, 13, 16, 17]. An antiperistaltic was used prior to prostate MRI in some or all patients in 5 studies [9, 12, 14, 15, 18].

Table 3 Summary of patient cohorts of included studies
Table 4 Summary of study design and interpretation scheme of included studies
Table 5 Summary of characteristics of MRI protocol of included studies

Of the ten studies with two readers, five of these entailed consensus readings. Of the five two-reader studies with independent readings, only one of these reported interobserver variability [15], noting κ = 0.90.

Quality assessment

The overall scores for the QUADAS-2 questions are shown in Fig. 2. Scores were very high, indicating low risk of bias, for questions relating to quality of patient selection [100% (13/13) of studies receiving a score “yes”] and quality of the flow and timing of study design [85% (11/13)–100% (13/13)] aside from a score of 54% (7/13) for the question relating to appropriate time interval between MRI and the reference standard. Scores were very low, indicating high risk for bias, for questions relating to the quality of index test interpretation [8% (1/13)–31% (4/13)]. Low performance in this category was attributed to the readers’ access to pathologic findings from either biopsy or radical prostatectomy at the time of ADC measurement as well as lack of a prespecified ADC threshold. The question relating to the quality of the reference standard received a low to moderate score [54% (7/13)] due to the lack of RP serving as reference standard for all patients in 6 studies.

Fig. 2
figure 2

Risk for bias of studies included within the meta-analysis, evaluated by QUADAS-2 questions. Bar chart demonstrates distribution of responses for the 11 included studies for each question. Green bar = response of “yes” (good quality), blue bar = unclear, and red bar = response of “no” (low quality)

Assessment of heterogeneity between studies

Between-study heterogeneity was quantified using τ2 among the 13 studies, with resulting τ2 = 0.222, although Cochran’s Q was not significant (10.81; df = 12, p = 0.545). A threshold effect was identified [moderate correlation between sensitivity and false-positive rate (r = −0.58)], indicating a tradeoff in sensitivity for specificity and accounting for some part of the apparent heterogeneity in test performance among studies.

Overall diagnostic accuracy

There were a total of 1107 tumor foci in 705 patients across the 13 included studies. Sensitivity and specificity were both variable among studies, [range, 51.0% (26/51)–90.8% (58.8/65)] and [range, 61.1% (11/18)–96.2% (24.7/26)], respectively. Figure 3 depicts the Forest plots of sensitivity and specificity for these 13 studies. The summarized sensitivity and specificity of ADC values for separating high-risk from low-risk PCa were 76.9% (95% CI 68.6–83.6%) and 77.0% (95% CI 69.9–82.8%), respectively. The positive likelihood ratio was 3.37 (95% CI 2.66–4.25). The negative likelihood ratio was 0.30 (95% CI 0.22–0.39). Figure 4 shows the sROC curve with AUC of 0.67, indicating moderate performance in discrimination of high-risk from low-risk PCa.

Fig. 3
figure 3

Forest plots of sensitivity and specificity for quantitative ADC values in separating high-risk from low-risk prostate cancer. Size of squares for individual studies represents weight of each study. Horizontal lines indicate 95% CIs of individual studies. Studies are ordered by increasing sensitivity

Fig. 4
figure 4

Summary receiver operating characteristic (sROC) curve of accuracy of quantitative ADC values in separating high-risk from low-risk prostate cancer. The ellipsoid confidence region depicts the 95% confidence region in the sROC space for the summary point estimate of diagnostic performance

Subgroup analyses

Table 6 details the results of the subgroup analyses. Sensitivities were significantly higher for studies in which a b value of 2000s/mm2 was not included in computation of the ADC values metrics (81.8% vs. 67.1%, p = 0.04) and for studies published in 2014 or 2015 (81.3% vs. 63.3%, p = 0.02). Specificities were significantly higher for studies in which ADC values were measured by 1 reader rather than by 2 readers (93.3% vs. 73.2%, p = 0.005). The remaining analyzed covariates showed no statistically significant difference in terms of sensitivity or specificity (p ≥ 0.162). At pairwise metaregression, both using ADC values derived from b values lower than 2000 and publishing in 2014 or 2015 remained as significant independent predictors of increased sensitivity (p < 0.001).

Table 6 Estimates of overall sensitivities and specificities for multiple subgroup analyses for separating high-risk from low-risk prostate cancer on MRI (95% CIs listed in parentheses)

Discussion

Our meta-analysis demonstrated that quantitative ADC values have moderate sensitivity and specificity in separating high-risk from low-risk PCa. Nonetheless, both sensitivity and specificity were heterogeneous across the 13 included studies. While studies varied greatly in technical factors, a moderate inverse correlation between reported sensitivity and specificity of individual studies indicates that apparent interstudy heterogeneity was at least partially due to variation along a single ROC curve. Differences in potentially important technical considerations including field strength, coil arrangement, and ADC metric (whether conventional mean ADC or an alternative advanced metric) may also contribute some degree of heterogeneity, though none of these covariates were associated with significant differences in sensitivity or specificity. Nonetheless, our meta-analysis was unable to fully account for all of the observed heterogeneity in test performance across studies, such that additional unexplained sources of heterogeneity likely exist.

The GS determined at RP serves as the single best predictor of mortality in patients undergoing RP for organ-confined PCa [4]. However, the majority of PCa patients have a low-risk tumor with indolent behavior and would be overtreated by RP. The aim in such patients is early and accurate risk stratification in order to avoid RP. To date, no risk stratification tool has been able to predict the GS at RP with sufficient accuracy in order to reliably distinguish between those patients who require RP to prolong cancer-specific survival and those patients who can be managed non-invasively. While the GS determined at systematic biopsy has traditionally served as the mainstay for clinical risk assessments, up to 43% of men with low-grade PCa on standard biopsy will be upgraded on RP [7]. More recently, MRI-targeted biopsy has shown increased accuracy in the determination of tumor grade compared to standard 12 core biopsies [39]. In spite of this improvement, targeted biopsy continues to miss some high-risk lesions. In a large prospective study, Siddiqui et al. demonstrated that fusion biopsy outperformed standard 12 core biopsy in detecting high-risk PCa when using RP as the reference standard (AUC of 0.73 compared to 0.59, respectively) [40]. Other studies have reported a similar trend [41, 42]. While a wide variety of approaches combining results of prostate biopsy with other data elements have also been explored for improved risk stratification, including risk categories incorporating PSA and biopsy findings [5], as well as advanced preoperative nomograms [43], these have also had at best moderate accuracy with AUCs ranging from 0.52 to 0.80 [6].

The previously described prognostic strategies for identifying which patients warrant aggressive treatment vs. those that can be managed non-invasively all fundamentally rely on information obtained from prostate biopsy. Furthermore, monitoring of patients on surveillance would require serial biopsies to detect a potential change. Our meta-analysis indicates that quantitative ADC values also achieve moderate accuracy in risk stratification, albeit in a completely non-invasive manner. In these regards, based on the available literature, ADC measurements appear to compare favorably with other current invasive schemes for risk stratification of PCa.

Only two of the evaluated study parameters were found to be significantly associated with study sensitivity. First, sensitivity was significantly higher in more recently published studies, which underscores the potential of continued optimization in enhancing the value of quantitative diffusion metrics in prostate cancer risk assessment. Such optimization may relate to increasing experience within the radiological community in the interpretation of prostate DWI as well as to continued advancements in the acquisition or postprocessing of prostate DWI, for instance involving hardware or software-related factors not captured in our analysis. Furthermore, this observation raises the possibility that the diagnostic performance of DWI for PCa risk assessment may continue to improve with further advances in imaging techniques. Second, sensitivity was significantly lower in studies incorporating a b value of 2000s/mm2 in ADC computations. This finding may at first seem surprising given the prior studies that have reported utility from including very high b value images in clinical assessments of prostate DWI [44, 45]. However, it has also been suggested that despite the value of the high b value images themselves, these should not be included in ADC map computations given non-mono-exponential diffusion behavior occurring at very high b values which is not reflected by standard ADC computation techniques, as well as the impact of low SNR at very high b values on quantitative ADC metrics [46]. Indeed, other studies have failed to confirm benefit of very high b values for tumor detection when interpretation is based primarily on assessment of the ADC map rather than of the acquired diffusion-weighted images [4749]. Thus, while there is growing support for very high b value images, based in part on studies supporting their role for tumor detection, it is possible that the very high b value images do not have similar added value for non-invasive risk assessment of a detected lesion.

The only study parameter that was observed to be significantly associated with specificity was the number of readers involved in ADC measurement. Specifically, the presence of one, rather than two, reader was associated with significantly increased specificity. The exact explanation for this association is unclear. There was no observed significant association between reader experience and test performance to suggest the presence of a “super-reader” in single-reader studies to account for the finding. Alternatively, we speculate that there may be a tradeoff in sensitivity and specificity when involving multiple readers in selecting areas for region-of-interest measurement, as the presence of 1 reader was also associated with a non-significant decrease in sensitivity. Indeed, the relatively small number of tumor foci in studies with a single reader, rather than 2 readers, performing ADC measurements (144 vs. 687 tumor foci, respectively) may have contributed to the lack of statistical significance for the comparison of sensitivity between these two groups of studies. Further evaluation of the influence of multiple readers is difficult given the common presence of consensus readings as well as the reporting of interobserver variability in only a single one of the included studies.

While the observed AUC was reassuring, our quality assessment demonstrated several key limitations in the summarized literature that may the affect the generalizability of the findings when seeking to apply quantitative ADC values prospectively. Specifically, a number of weaknesses in design were common among the included studies: the availability of pathologic information at the time of ADC measurement, the lack of a prespecified ADC threshold, and (in a smaller number of studies) the lack of radical prostatectomy as reference standard. In addition, some studies that used prostatectomy as reference standard performed radiologic-pathologic correlation on a per-patient rather than a per-lesion basis [12, 14, 15]. As a result, it is possible that the observed accuracy of ADC measurements in the included studies may overestimate the actual test performance of ADC. Thus, prior to prospective application of ADC measurements in clinical management, additional investigations are required to validate the role of ADC values using measurements blinded to pathologic information derived from radical prostatectomy in combination with testing of an a priori determined threshold. Aside from these several important concerns, other key aspects of study quality were overall strong among the included studies.

There are several limitations in our meta-analysis. First, our sample size of 13 studies was small. A lack of availability of 2 × 2 data in many studies examining the role of ADC in PCa risk stratification contributed to this sample size. For example, while a number of studies have explored associations between ADC metrics and prostate cancer aggressiveness, potentially with high-quality study designs including radical prostatectomy as the reference standard, the data of such studies may have largely comprised correlation coefficients or summary AUC values without actual sensitivity and specificity data, thereby precluding formal inclusion in our meta-analysis (Appendix in Electronic Supplementary Material). It is anticipated that additional original data will continue to accumulate as further investigations are performed. Also, we pooled studies utilizing different ADC metrics field strengths, and coil designs. While our meta-analysis showed no significant effect of such technical considerations on accuracy, the study may have been underpowered to detect significant differences in some of the subgroups. In addition, our meta-analysis relies on GS, whether obtained by biopsy or RP, as the reference standard. However, substantial literature demonstrates that determination of the Gleason score itself is variable among pathologists [5052]. This inherent heterogeneity of the reference standard is a further potential source of variation in performance of ADC metrics not considered by the explored radiological literature. Finally, the retrospective nature of the studies may not reflect the performance of ADC measurements in routine clinical practice.

In conclusion, the available literature supports that quantitative ADC values have moderate accuracy in separating high-risk from low-risk PCa, comparing favorably to that of other currently accepted invasive predictors. Sensitivity was higher in more recent studies and in studies not using a b value of 2000s/mm2, while specificity was higher in studies with a single reader. While heterogeneity in diagnostic performance of ADC across studies could not be fully explained by our meta-analysis, such heterogeneity appeared to represent, at least in part, a tradeoff between sensitivity and specificity. Although the improvement in sensitivity in more recent publications is encouraging regarding the potential future role of quantitative diffusion metrics in prostate cancer risk assessment, methodological heterogeneity exists in numerous technical considerations across studies. Moreover, the available studies may have overestimated the true test performance of ADC due to several identified common methodological biases, including the lack of radical prostatectomy as a reference standard in a number of studies. Thus, additional prospective investigations evaluating the performance of ADC values obtained using a standardized methodology that addresses these particular weaknesses are needed prior to applying ADC values to guide clinical decision-making for PCa.