Introduction

The incidence and prevalence of prostate cancer (PCa) have both increased significantly over the past two decades, and PCa has surpassed lung cancer as the most common cancer in men. It is generally accepted that these changes resulted from PSA screening. At the same time, PCa-specific mortality has decreased, which is related to early detection and treatment. Nevertheless, treatment of low-risk PCa results in unnecessary side effects, which impair the quality of life of patients and their families, and healthcare expenses.1, 2 Active surveillance (AS) is an increasingly important attempt to avoid overtreatment of patients who harbor clinically insignificant disease while offering curative treatment to those in whom disease is reclassified as higher risk after an observation period and repeat biopsy.3, 4, 5, 6

Up to 35% of patients in AS will experience biopsy reclassification at a median follow-up of 2.9 years.7, 8, 9 Unfortunately, noninvasive methods of cancer surveillance, including PSA doubling time (area under the curve (AUC) 0.59) and PSA velocity (AUC 0.61), are not associated with prostate biopsy reclassification, and annual prostate biopsy is currently recommended by some for monitoring men undergoing AS.9 Nevertheless, serial prostate biopsies are associated with potentially serious infectious and quality-of-life sequelae.10, 11 Improved patient selection and less invasive methods of cancer surveillance are needed to improve the safety of AS in the management of low-risk PCa.

Magnetic resonance imaging (MRI) can provide additional high-resolution information on tissue properties, such as diffusion and enhancement. MRI enables soft tissue contrast and characterization, and advanced computational methods are available to assess function. Thus, MRI alone or combined with clinical parameters may be useful in the prediction of insignificant PCa, particularly in the context of clinically nonpalpable tumors.12, 13, 14 Previously, studies have shown that MRI may be a promising imaging technique in identifying men entering AS and monitoring these men in the AS cohort. This study is a diagnostic meta-analysis that aims to summarize the diagnostic performance of MRI on PCa reclassification among AS candidates.

Materials and methods

Selection of studies

All authors participated in the selection of eligible studies for inclusion. We reviewed PubMed for citations published before November 2014, describing MRI used in AS for PCa. The search strategy included the terms prostate neoplasms (MeSH) or PCa, watchful waiting (MeSH) or AS or expectant management, reclassification or histologic progression, and MRI. Article titles and abstracts were reviewed independently for eligibility by two authors, and discrepancies were resolved by consensus. Studies were included if they met all the following criteria: (1) the study population consisted of PCa patients for AS; (2) the study assessed diagnostic performance of MRI for histologic progression; (3) cc analysis was used as the reference standard test; and (4) if at least one pair of the absolute numbers of true-positive results and false-negative results or true-negative results and false-positive results were available or could be derived adequately. To include true-positive results, false-positive results, true-negative results and false-negative results in a meta-analysis, all four should be available. Studies were excluded if they met one of the following criteria: (1) the article was a review or meta-analysis and (2) (potentially) overlapping study populations were reported for the same outcome.

Data extraction

Two reviewers independently screened all titles identified in the database searches. The full text of all articles included by either reviewer on the basis of the abstract was obtained. To determine eligibility for inclusion, one author reviewed all full-text articles. A second author repeated this assessment independently for a random selection of 10% of full-text articles, and there was complete agreement regarding the excluded articles. Two reviewers independently extracted data from all of the included studies. One author extracted all data from all studies, and the other author independently re-extracted data from all of the included studies between them. Discrepancies were resolved by consensus. Full-text copies of potentially relevant studies were obtained and their eligibility for inclusion was independently assessed. Studies that do not fulfill all of the inclusion criteria were excluded.

Quality assessment

We used the revised Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) checklist to assess the study quality in terms of the risk of bias and the applicability of included studies.15, 16 Each study was independently assessed by two reviewers. Any discrepancies were resolved by discussion or, if agreement could not be reached, by arbitration by a third reviewer.

Data analysis and synthesis

We used standard methods recommended for meta-analyses of diagnostic test evaluations. The following measures of test accuracy were computed for each study: sensitivity, specificity, positive likelihood ratio (PLR), negative likelihood ratio (NLR), diagnostic odds ratio (DOR), and their 95% confidence interval (CI). The advantage of DOR is its independence from disease prevalence and the approximately normal distribution of the natural logarithm of DOR.17 As a general rule, diagnostic tests with a DOR >25 are considered moderately accurate, and tests with a DOR >100 are considered highly accurate.18, 19

If the 2 × 2 data were not available, attempts were made to derive them from reported summary statistics, such as the sensitivity, specificity and/or likelihood ratios.

Preliminary exploratory analyses were conducted for each test by plotting estimates of the sensitivity and specificity from each study on forest plots and in the receiver operating characteristic (ROC) space. The AUC, the corresponding 95% CI and the P-value were calculated.

The sensitivity and specificity for the single test threshold was used to plot a summary receiver operating characteristic (SROC) curve.20

We adopted the following overall approach for the evaluation of heterogeneity in the results expected between studies of diagnostic tests. We assessed the heterogeneity using forest plots and then statistically tested for significance using the Chi-squared test and the I2 statistic (I2>50% indicates significant heterogeneity).21 Consequently, we calculated pooled estimates using bivariate mixed-effects binary regression modeling, which provides more conservative estimates than fixed-effects modeling when heterogeneity was present.22 Spearman’s correlation analysis was used to check the threshold effect, which is conductive to heterogeneity. In addition, meta-regression was analyzed to identify possible sources of heterogeneity. Subgroup analyses and sensitivity analysis were also performed if necessary.

Publication bias is considered as a concern for meta-analyses of diagnostic studies. In our study, publication bias was assessed by the regression of the natural logarithm of DOR against the square root of the effective sample size; P<0.05 for the slope coefficient is suggestive of significant publication bias.23 The MIDAS module of STATA commands was used for the meta-analysis of the diagnostic data.24, 25 STATA version 12 (StataCorp, College Station, TX, USA) and Meta-DiSc software (version 1.4; XI Cochrane Colloquium, Barcelona, Spain) were used for data analysis. Significance level was set to 0.05.

Results

A total of 47 articles from PubMed were initially retrieved. The last update was on November 2014 (Figure 1). After reading the title and the abstract, 12 articles were found to be suitable for further evaluation. Of the 12 articles reviewed in detail, 5 were excluded (2 were reviews, 2 presented nonextractable data and 1 focused on expanding AS criteria), leaving 7 studies to be included in the final analysis (Table 1).26, 27, 28, 29, 30, 31, 32

Figure 1
figure 1

Flowchart of study selection.

Table 1 Summary of included studies and diagnostic estimates (individual study data and independent pooled sensitivity/specificity estimates)

Study characteristics and quality assessment

After reading the full text, seven selected articles included 1028 patients. The samples were ranged from 50 to 388 patients. All of the seven included essays reported the mean age, which ranged from 60.2 to 67 years. Three of the seven studies included very low-risk PCa patients satisfying Johns Hopkins AS criteria,28, 31, 32 three included low-risk PCa patients26, 29, 30 and only Fradet et al.27 identified patients with low- and intermediate-risk PCa. However, the AUC values were only reported in Vargas and Stamatakis’s reports.26, 28 The values of the MRI on disease reclassification and the basic characteristics (true-positive, false-positive, true-negative and false-negative values for MRI) were shown in Table 1, and the summary of included studies (MRI techniques, confirmatory biopsy and definition disease progression) was shown in Table 2 and Table 3.

Table 2 Summary of included studies (MRI techniques)
Table 3 Summary of included studies (confirmatory biopsy and disease progression)

The quality varied across included studies (summary of QUADAS-2 quality data, Figure 2 and Figure 3); however, all had low to unclear risk of bias and applicability concerns, and therefore we did not exclude any from analysis. None of the included studies had a case-control design. All studies had the same reference standard because histopathologic analysis was the inclusion criteria of this meta-analysis. The MRI interpretation was blinded in five studies, but this was unclear in the other two studies. Histopathologic interpretation was blinded in four studies, whereas this was unclear in three studies. We scored four of seven studies as having an appropriate interval between the MRI and biopsy, and we scored the other three studies as unclear because they did not report an interval at all.

Figure 2
figure 2

Graph showing the risk of bias and applicability concerns: review of authors’ judgements about each domain, presented as percentages across included studies.

Figure 3
figure 3

Chart summarizing the risk of bias and applicability concerns: review of authors’ judgements about each domain for each included study. −, high concern; ?, unclear concern; +, low concern.

Diagnostic accuracy

The sensitivity point estimates ranged from 0.19 to 0.93 across individual studies, and the specificity point estimates ranged from 0.40 to 0.97 (Figure 4). The pooled estimates of MRI for disease reclassification diagnosis were as follows: sensitivity, 0.69 (95% CI, 0.44–0.86); and specificity, 0.78 (95% CI, 0.53–0.91). Positive predictive value (PPV) (0.44) was relatively low, whereas negative predictive value (NPV) (0.91) was high, given a pretest probability of 20%. We also noted that the PLR was 3.1 (95% CI, 1.6–6.0), the NLR was 0.40 (95% CI, 0.23–0.70) and the DOR was 8 (95% CI, 4–16).

Figure 4
figure 4

Forest plot showing the sensitivity and specificity of magnetic resonance imaging in the diagnosis of prostate cancer reclassification among active surveillance candidates.

The SROC curve summarizes the overall diagnostic accuracy, showing the tradeoff between sensitivity and specificity. We found that the AUC was 0.79 (95% CI, 0.76–0.83) (Figure 5).

Figure 5
figure 5

Summary receiver operating characteristic (SROC) curve for assessment of the diagnostic accuracy of magnetic resonance imaging in the diagnosis of prostate cancer reclassification among active surveillance candidates.

On the basis of the ROC plots, we identified one study as a potential outlier: the study by Hanna et al.29 After exclusion of this potential outlier, the diagnostic accuracy of MRI for disease reclassification among AS candidates showed a sensitivity of 0.74 (95% CI, 0.45–0.91), a specificity of 0.81 (95% CI, 0.56–0.94) and a DOR of 13 (95% CI, 8–20).

Threshold effect and subgroup analysis

As different sensitivity and specificity by various research conditions cause different threshold effect33 and DOR, it is necessary to assess the presence of a threshold effect. In addition, if a threshold effect exists, sensitivity and specificity show a negative correlation and the worker ROCplane scatter distribution is in a typical ‘shoulder arm-shaped’ style. In our study, the MRI of SROC curve does not show this typical style (Figure 5). It indicates that there is no heterogeneity from threshold effect, while the Spearman correlation coefficient was 0.679 and P-value was 0.094.

After testing for heterogeneity caused by other sources, the results showed that sensitivity (P<0.001, I2=91.3%), specificity (P<0.001, I2=96.9%), PLR (Cochrane Q=59.04, P<0.001, I2=83.8%), NLR (Cochrane Q=99.93, P<0.001, I2=94.00%) and DOR (Cochrane Q=22.37, P=0.001, I2=73.2%) in the included studies showed high heterogeneity.

We consider the quality of the studies, AS protocols, year, country, MRI technique and biopsy strategy for the sources of the heterogeneity. Through meta-regression, we found that the P-value of the MRI with endorectal coils (ERCs) was 0.0182 and obviously less than others. The results suggested that the MRI technique might contribute to heterogeneity (Table 4).

Table 4 Meta-regression analysis of diagnostic accuracy and summary of diagnostic estimates of subgroup analysis

A subgroup analysis was carried out using the following criteria—(1) AS candidates were divided into two groups according to AS protocols: three studies included very low-risk PCa patients and four studies included low-risk PCa; (2) studies included were divided into two groups according to year: three studies before 2012 or in 2012 and four studies after 2012; (3) patients were divided into two groups according to country: five studies from the USA and two studies outside the USA; (4) studies included were divided into two groups according to MRI technique: four in multiparametric (MP)-MRI group and three in non MP-MRI group, and six in the group with ERCs and one in the group without ERCs; (5) studies included were divided into two groups according to biopsy strategy: four studies with targeted biopsy and three studies without targeted biopsy. Only in studies with targeted biopsy and those without targeted biopsy there was significant difference in the sensitivity (0.80 vs 0.31, P<0.01) and specificity (0.63 vs 0.95, P=0.02).

Publication bias

The Deeks’ funnel plots for publication bias23 also showed no asymmetry (Figure 6). The evaluation of publication bias showed that the bias coefficient was −11.66 and that it was not significant (P=0.30). There was no evidence that publication bias existed. However, because of the limited number of the articles, concluding whether the publication bias existed in this meta-analysis is difficult.

Figure 6
figure 6

Deeks’ funnel plots for publication bias.

Clinical utility of index test

From the Fagan’s nomogram, we found that when 20% was selected as the pretest probability the post-test probability would raise to 43% with a PLR of 3.1, and the probability would decrease to 9%, whereas the NLR was 0.40.

Discussion

We have performed a meta-analysis of diagnostic accuracy of MRI on disease reclassification among AS candidates. Although meta-analysis is not yet a widely approved method to summarize evidence from diagnostic studies, we believe that pooling the diagnostic accuracy from eligible studies provides valuable information for urologists and researchers until larger studies are available.

We presented the PLR, NLR and DOR as our measures of diagnostic accuracy. PLR>10 or NLR<0.1 indicated high accuracy. DOR>25 was considered moderately accurate, and DOR>100 was considered highly accurate. A PLR value of 3.1 suggests that AS candidates with MRI suspicious disease progression have an approximately 3.1-fold higher chance of being MRI negative. In addition, the NLR was found to be 0.4; therefore, if the MRI result was negative, the probability that this AS candidate has disease progression that needs reclassification was ∼40%. A DOR of 8, although >1, indicates relatively poor performance in PCa reclassification.

Our relatively low PLR (3.1) and high NLR (0.4), combined with poor sensitivity (0.69) and specificity (0.78), do not give a sustained evidence of diagnostic accuracy of MRI on disease reclassification. These results suggest that the accuracy of MRI in the PCa reclassification may not be as high as previously described in some studies. In the study by Vargas et al.,26 the sensitivity and specificity were 0.89 and 0.70, respectively. However, AUC value in our SROC curve was 0.79, indicating a moderate accuracy of the value of MRI in disease reclassification among AS candidates, as it is >0.7.34 The relatively high specificity and AUC value demonstrated that a very low-risk or low-risk PCa with the negative MRI result supports no disease reclassification and continued AS.

Important statistical heterogeneity (I2>50%) was found in the analysis, which might be a confounding factor for the results. In addition, the threshold effect must be considered first in test accuracy studies, which arises when differences exist in sensitivity and specificity owing to the threshold used in different studies to define a positive or negative test result.33, 35 We use Spearman correlation coefficient to analyze the threshold effect, and its value is 0.679 (P=0.094), which indicates that there is no heterogeneity from threshold effects. To further evaluate the sources of heterogeneity, meta-regression analysis is conducted in terms of quality of the studies, AS protocols, year, country, MRI technique and biopsy strategy. This analysis indicates that the results may be influenced by Hanna’s study.29 However, after this study is removed, the data analysis results do not change obviously.

In the subgroup analysis, only after dividing the seven included studies into two categories according to biopsy strategy, the sensitivity in the targeted biopsy category is significantly better than non-targeted biopsy category, with the exception of specificity, indicating that the value of MRI on PCa reclassification among AS candidates may be different whether it is targeted biopsy or not. Theoretically targeted biopsies from lesions based on clinical or imaging findings could have improved tumor burden detection on confirmatory biopsies. Moreover, MRI–ultrasound fusion biopsy detected clinical significant cancer in AS patients with far fewer cores while rarely missing Gleason 7 cancer.36, 37 In the included studies, only Stamatakis et al. used MRI-ultrasound-guided targeted biopsy. The AS protocols and nationality of included population had no significance on the results of subgroup evaluation. It should also be noted that the sensitivity in the year after 2012 category and the specificity in the MP-MRI category is better than the opposite category, with P=0.06 and P=0.07, respectively, which were approaching 0.05. As 1.5 T MRI examinations were performed before or in 2012, and 3 T MRI was almost used after 2012 in the included studies, we found no differences in the diagnostic performance between 1.5 and 3 T MRI studies, and that between MP-MRI and non MP-MRI studies. This suggests that technological improvements in the past decade may not have significantly affected MRI sequences. 3 T MRI provides prostate images that are natural in shape and that have comparable image quality to those obtained at 1.5 T with an ERC, but not superior diagnostic performance, which suggests that an opportunity exists for improving technical aspects of the 3-T prostate MRI.38 Although there is currently great interest in the use of MP-MRI sequences such as T2-weighted, echo planar, diffusion-weighted imaging and dynamic contrast-enhanced MRI, diffusion-weighted imaging and dynamic contrast-enhanced-MRI have only been clinically available for a few years. Any single technique would lack the sensitivity to detect PCa before AS, since most patients harbor low-volume cancer. Further studies are needed to assess the incremental value of MP-MRI.

The most important finding of our study is that MRI may reveal an unrecognized significant lesion in 33.27% of patients, and biopsy of these areas reclassified 14.59% of cases as no longer fulfilling the criteria for AS. In addition, when no suspicious disease progression (66.34%) was identified on MRI, the chance of reclassification on repeat biopsy was extremely low at 6.13%.

Several earlier studies examined MRI in AS candidates with low-risk PCa. Cabrera et al.39 found no association among clinical stage, Gleason, PSA and apparent tumor on endorectal MRI. In another retrospective analysis, van As et al.40 pointed out that low apparent diffusion coefficient was associated with adverse histology on repeat biopsy. Tumor apparent diffusion coefficient correlated with maximum core involvement, the percent of positive cores, initial PSA and the free-to-total PSA ratio.

Limitation should also be considered when we draw conclusion. First, as we have only searched the studies on PubMed, studies on other databases and unpublished data may not be included in our meta-analysis. Second, the sample sizes of some selected studies are small, which can involve overestimating the diagnostic accuracy, and thus subgroup analysis is limited by the restricted original data. Third, variability in imaging techniques, MRI criterion, reader experience and biopsy techniques are possible confounders. For this reason, the bivariate mixed-effects binary regression modeling was used, as it provides more conservative estimates. Last but not the least, although the funnel plot does not show this meta-analysis as having publication bias, and the number of the studies is small, the influence of publication bias could not be completely excluded.

In summary, this study presents a cumulative analysis of almost 1028 patients from seven studies and demonstrated the value of MRI as a predictor of PCa reclassification among AS candidates. MRI shows a very high NPV for the intermediate end point of disease upstaging. Therefore, favorable MRI findings on high-quality MRI, that is, MP-MRI, may be used for selection and follow-up of AS candidates and might refrain the need for repeat biopsies. However, the PPV of MRI for higher-risk disease seems to be considerably lower in the low-risk PCa patients under AS and may be caused by reporter bias with more false positives in AS cohorts who are known to have PCa. The reported range of percentages of disease progression identified in AS candidates is very wide (19–93%). All these indicate both that a negative MRI in an AS candidate is supportive of keeping him in AS status and that the finding of an MR index lesion, especially that greater than 10 mm, has a potential to become a marker for patients who require increased attention to allow targeted biopsies of index lesions.

Conclusion

MRI, especially MP-MRI, has a moderate diagnostic accuracy as a significant predictor of disease reclassification among AS candidates. MRI of the prostate was found to have a high NPV and specificity for the prediction of biopsy reclassification upon clinical follow-up, which suggests that negative prostate MRI findings may support a patient remaining under AS. Although the PPV and sensitivity for the prediction were relatively low, the presence of a suspicious lesion >10 mm lesion may suggest an increased risk for disease progression.