Introduction

Histopathological tumor grading reflects the degree of differentiation of a given tumor and for most solid tumors, including urological tumors, grading is one of the most important prognostic markers. This is reflected by the fact that according to the working classification of prognostic factors, introduced by the College of American Pathologists [1], for many tumors, grading is classified as a category I prognostic factor, meaning that its prognostic value is well supported by the literature and that it is generally used in patient management. Prognostic factors of category II, instead, have extensively been studied biologically and/or clinically, but have not been conclusively proven to be of value in multivariate analyses. The remaining category III applies to those factors that show some promise but do not meet the criteria of categories I or II.

Given its strong prognostic impact in predicting the biological aggressiveness of malignant tumors, the tumor grade, provided by pathologists, strongly influences the clinical management of tumor patients. Consequently, an ideal grading system should meet at least two major requirements at the same time: it should be of high prognostic relevance and of high reproducibility among different pathologists. Therefore, several histopathological grading systems have been developed for different tumor entities as well as for a given tumor type. All of these grading systems have in common an inherent degree of subjectivity, resulting in both intra- and interobserver variability. The reproducibility of these grading systems among pathologists has been analysed in several studies. The comparability of these studies, however, is limited, as for example different statistical methods have been used. Whereas some studies only determined the percentage of agreement, others used the more elaborate kappa (κ) and weighted kappa (κw) analyses. Kappa (κ) and κw are very useful measures of interobserver agreement, as the level of agreement is adjusted for that expected by chance [24]. When the observed agreement exceeds chance agreement, κ is positive, with its magnitude reflecting the strength of agreement. Thus, κ 0.00–0.20 reflects slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, and 0.81–1.00 almost perfect agreement [24]. In addition to κ, κw uses weights to quantify the relative difference between categories. Close disagreement is not weighted as heavily as more serious disagreement.

In the present review, grading systems for the most frequent urological tumors (i.e. prostate cancer, renal cell carcinoma, and urothelial tumors) are mentioned and data on the reproducibility and reliability of the most commonly used grading systems are summarized.

Prostate cancer

In order to predict the clinical behaviour and aggressiveness of prostate cancer, several grading systems with proven prognostic relevance have been developed by pathologists in the past decades [510]. Among these, the Gleason grading system emerged to be universally acknowledged and most commonly used, and consequently it was also included in the new World Health Organization (WHO) classification [11]. Since its first description the Gleason grading system was slightly modified in the mid sixties [5, 12] and mid seventies of the last century [13, 14], and lastly in 2005 by the international society of urological pathology (ISUP) consensus conference on Gleason grading of prostate carcinoma [15].

To be of clinical relevance, a grading system must display sufficient reproducibility and grading done on biopsies should be reasonably representative for the tumor as a whole [7, 16]. The correlation between biopsy and prostatectomy Gleason scores has been investigated in several studies [1720] and an exact concordance was only found in 28–68% (pooled data, 44.5%). This relatively poor concordance is mainly caused by undergrading of low-grade carcinomas in biopsy specimens, whereas higher agreement is achieved in grading of high-grade carcinomas [16, 21, 22]. More specifically, biopsies were found to be undergraded in 24–60% (pooled data 45%) and overgraded in 5–32% (pooled data, 10.4%) [1720].

To improve this concordance, the Gleason grading system and its practical application by pathologists have recently been subjected to several modifications by an ISUP consensus conference [15]. Of these, the most important modifications refer to the definitions of Gleason pattern 3 and 4, respectively. Thus, “individual cells” are no longer allowed within Gleason pattern 3 and most cribriform patterns are diagnosed as Gleason pattern 4, with only rare cribriform lesions satisfying diagnostic criteria for cribriform pattern 3 (i.e. rounded, well-circumscribed glands of the same size as normal glands). Ill-defined glands with poorly formed glandular lumina warrant the diagnosis of Gleason pattern 4, whereas very small well-formed glands still are within the spectrum of Gleason pattern 3. In addition, it was recommended that on needle biopsies with more than two different Gleason patterns, both the primary pattern (i.e. the most prevalent pattern) and the highest grade should be recorded. Furthermore, the diagnosis of Gleason score 4 on needle biopsies should only be made “rarely, if ever”. As a more detailed description of all modifications, introduced by the ISUP consensus conference, is beyond the scope of this review, for further information the reader is referred to the respective literature [15, 23]. Interestingly, a recent study showed that these modifications resulted in a significant improvement of concordance between biopsy and prostatectomy Gleason scores from 58 to 72%, when compared to conventional Gleason grading [16]. It is still unclear whether or not the overall concordance between biopsy and prostatectomy Gleason scores can also be significantly increased by using extended biopsy schemes. While an earlier study suggested that the prediction of the prostatectomy Gleason score is only marginally improved by increasing the number of biopsies [24], it was recently shown that the overall concordance between biopsy and prostatectomy Gleason scores can significantly be increased from 48 to 68% by using an extended biopsy scheme (mean 12.4 biopsy cores) rather than a traditional sextant scheme [25].

In addition to its representativity for the tumor as a whole, a clinically useful grading system must be sufficiently reproducible. The Gleason grading system, like all histological grading methods, possesses an inherent degree of subjectivity. Consequently, both intra- and interobserver variability exist. Intraobserver agreement of Gleason grades was reported in 43–78% of cases and agreement within ±1 Gleason score unit was found in 72–87% of cases [2628]. This is in line with the intraobserver agreement of Dr. Gleason himself, who wrote that on re-examining routine clinical material (not prototypical examples), including approximately 50% needle biopsies, he has duplicated exactly his previous histological scores approximately 50% of the time and within ±1 histologic score (range 2–10) approximately 85% of the time [29]. In contrast to the limited number of studies on intraobserver agreement of Gleason grading, several studies addressed the question of interobserver reproducibility (Table 1). The comparability of these studies, however, is limited as the study designs vary partly considerably in terms of the definition of agreement (e.g. exact agreement, Gleason score ±1, or Gleason categories), the statistical analysis of agreement (e.g. percentage values, κ values, or weighted κ values), the type of specimens investigated (e.g. biopsies, radical prostatectomy specimens, transurethral resection specimens, a mixture of these, or tissue microarray spots), the number of specimens investigated, the number of pathologists involved, and the qualification/specialisation of pathologists involved (e.g. general pathologists, genitourinary pathologists, or expert genitourinary pathologists). Overall the interobserver agreement in these studies was mostly moderate with κ values around 0.5–0.6 (Table 1). Given the high clinical relevance of Gleason grading for making treatment decisions, this level of agreement is unsatisfying. Apparently, the main reason for this situation is insufficient experience and familiarity of pathologists with this grading system. This is suggested by the fact that genitourinary pathologists or pathologists with special interests in genitourinary pathology seem to achieve higher agreement levels with κ values of 0.6–0.8 than general pathologists [3032]. In line with this, Allsbrook et al. found that pathologists, who learned Gleason grading at a meeting or a course, achieved higher agreements than pathologists who had not [31]. Overall, the major interobserver reproducibility problem is undergrading. In particular Gleason scores 5–6 are often underdiagnosed as Gleason scores 2–4, and cribriform sheets and fragments of Gleason pattern 4 are often mistaken for Gleason pattern 3 [3032]. To overcome this problem different teaching methods have been developed and it was shown that this way the level of agreement in Gleason grading could be markedly improved from moderate to substantial with κ values ranging from 0.68 to 0.78 [3335]. Moreover, a recent ISUP consensus conference developed new standards both in the definition of pattern characteristics and in the application of the Gleason grading system in general [15]. Application of these recommendations already led to increased concordance between biopsy and prostatectomy Gleason scores, when compared to conventional Gleason grading [16]. So far, however, only one study investigated the effects of these recommendations on interobserver agreement in Gleason grading. According to this study interobserver agreement of modified Gleason grading, as measured by κw statistics in a cohort of 69 consecutive prostatectomy specimens, is at least as high as that of conventional Gleason grading with mean κw being 0.58 and 0.56, respectively [36]. However, the effect of modified Gleason grading on intraobserver agreement still remains to be determined.

Table 1 Literature review on interobserver agreement of Gleason grading

Renal cell carcinoma

Grading of renal cell carcinoma (RCC) was first reported in 1932 by Hand and Broders [37], describing that patients with high grade carcinoma were at higher risk to die than patients with low grade tumors. Since then, numerous studies have established the prognostic value of RCC grading and different grading systems emerged from these studies (for review see [38]). Of these, nuclear grading systems seem to be more predictive of disease-specific survival than grading systems based on cytoplasmic and/or architectural features of tumor cells [3943]. As each system has its own advantages and disadvantages [38], there is no consensus yet as to which grading system should be used [42]. Rather in 1997 it has been stated by an International Consensus Conference that an ideal grading system, which preferentially should be a three-grade system, needs to be established [42]. By now, however, the four-tiered Fuhrman grading system is the most commonly used system in Europe and North America.

Studies on the reproducibility of RCC grading are limited and most of them refer to the Fuhrman grading system (Table 2). The first study addressing this topic was published by Lanigan et al. in 1994 [44]. In this study the authors compared the reproducibility of four different grading systems, including the Arner, Skinner, Syrjanen-Hjelt and Fuhrman systems, among four different pathologists [44]. Using the κ statistics of Landis and Koch [3] as a measure for interobserver agreement, which corrects for chance agreement, interobserver agreement was found to be fair to moderate with κ values of 0.42 (Syrjanen-Hjelt system), 0.33 (Fuhrman system), 0.26 (Skinner system) and 0.24 (Arner system), respectively. In a more recent study on 2042 RCCs, original nuclear grades as assigned at initial pathological diagnosis were compared to standardized nuclear grades reassigned after slide review and results were stratified for the different histological subtypes of RCC (i.e. clear cell, papillary and chromophobe cell type) [45]. For clear cell RCC nuclear grade remained unchanged on review for 56.32% of the tumors, while 35.26% were upshifted by one or more grades and 8.42% were downshifted on review. For papillary RCC nuclear grade remained unchanged for 49.1% of the tumors, whereas 44.1% were upshifted by one or more grades and 6.8% were downshifted. Similarly, for chromophobe cell RCC nuclear grade was unchanged in 55%, upshifted in 38% and downshifted in 7% of the tumors. κ values were not determined in this study. Of note, for all histological subtypes the reviewed grades were more predictive of death due to RCC than the respective original grades and this also held true after adjusting for the 1997 TNM stage. Recently, however, the relevance of Fuhrman grading in chromophobe cell RCC was questioned, as in a cohort of 87 cases Fuhrman grading failed to show any significant association with the patients’ outcome [46].

Table 2 Literature review on interobserver agreement of grading systems of renal cell carcinoma

In another study, original Fuhrman grades of 388 clear cell RCCs were compared to Fuhrman grades reassigned by a single pathologist after slide review [47]. Thus, tumors originally classified as G1 tumors were upshifted by 1 grade in 38.7% of the cases, by 2 grades in 18.9% of the cases and by 3 grades in 2.7% of the cases. Tumors originally classified as G2 were upshifted by 1 grade in 34% of the cases and by 2 grades in 4.3% of the cases. Grading of tumors originally classified as G3 and G4 remained unchanged in 73.1 and 89.3%, respectively. Overall, interobserver concordance in this study was moderate as indicated by a κ value of 0.44.

In a retrospective multicenter study, Lang et al. [48] investigated interobserver agreement of three pathologists in grading 241 RCCs according to the Fuhrman system. Using the original four-tiered Fuhrman grading system, a concordance rate among the three pathologists of only 24% was observed. The corresponding mean interobserver κ value was 0.22 (range 0.09–0.36), indicating fair agreement according to the commonly used interpretation of κ values, as proposed by Landis and Koch [3]. The level of concordance, however, could be improved by collapsing the original four-tiered grading system to three-tiered and two-tiered grading systems, respectively. Thus, the highest mean κ value was yielded by a two-tiered scheme in which low-grade (grade 1–2) tumors were distinguished from high-grade (grade 3–4) tumors. This way, agreement among all three pathologists occurred in 58.9% of the cases and a mean κ value of 0.44 (range 0.32–0.55) was achieved. Most importantly, the original 4-tiered Fuhrman grade was an independent prognostic factor for all three pathologists and nuclear grade continued to have independent prognostic value after the optimal collapsing algorithm was performed.

Similar results as reported by Lang et al. [48] were obtained in a study of Al-Aynati et al. [49]. In a cohort of 99 RCCs interobserver variability in four-tiered Fuhrman grading was determined among four pathologists and a mean κ value of 0.29 was observed. By combining Fuhrman grades 1 and 2 as low-grade tumors and grades 3 and 4 as high-grade tumors, interobserver agreement could be improved as indicated by a mean κ value of 0.45. In addition to their analysis on interobserver variability, the authors also addressed the question of intraobserver variability in the same study. To this end all four pathologists had to reassign Fuhrman grades to all cases after a period of 3–5 months. Using the four-tiered Fuhrman grading system, intraobserver κ values ranged from 0.29 to 0.62 (mean = 0.45), indicating a moderate level of concordance according to Landis and Koch [3]. When collapsing the diagnostic grades to 2 (low-grade tumors vs. high-grade tumors), intraobserver κ values ranged from 0.4 to 0.64 (mean = 0.53), reflecting a slight improvement within the range of moderate agreement.

Overall, interobserver agreement in nuclear grading of RCCs appears to be only fair to moderate, independent of the grading system used. This could at least in part be due to the heterogeneity of RCCs, as areas of different grades are often found within a given tumor [38, 49]. Nuclear grading should be based on the highest-grade area identified within a tumor. However, the minimum size required for such an area to be considered significant has not yet been standardized [38, 42, 49].

Urothelial neoplasms of the bladder

In the past decades, several classification and grading systems for urothelial neoplasms of the urinary bladder, including the 1973 WHO system [50], the Bergkvist system [51], the Murphy system [52], and the Pauwels system [53] have been proposed. Of these, the 1973 WHO classification and grading system has most commonly been used and remained unchanged for about 30 years. With regard to tumors, diagnosed as carcinomas, histologic grading was based on the degree of cellular anaplasia using a three-tiered scale: grade 1 (G1) was characterized by the least degree of anaplasia compatible with malignancy, grade 3 (G3) by the most severe degree of anaplasia, and grade 2 (G2) by an intermediate degree of anaplasia. The main problem with this grading system was the lack of defined cut-off points among the three grades of differentiation, giving rise to high intra- and interobserver variabilities (Table 3) [5456]. This is reflected by the fact that the reported frequency of G2 tumors in non-selected tumor cohorts ranged from 13 to 69%, the frequency of G1 tumors from 8 to 25%, and the frequency of G3 tumors from 23 to 63% [53, 57, 58]. Ooms et al. [59] investigated interobserver variability in grading of bladder cancer among six pathologists in a setting of 57 cases. The Spearman rank-order correlations coefficients were 0.5–0.67 for intra- and 0.46–0.58 for interobserver variability and thus the results were interpreted by the authors as “disturbingly high” variability, which might invalidate the usefulness of grading in clinical decision making. Tosoni et al. [60] found interobserver discrepancies in grading in 39% of 301 cases of pTa and pT1 bladder cancers, respectively. In a study by Robertson et al. [61] interobserver agreement among 11 pathologists was slight to moderate as reflected by κ values ranging from 0.19 to 0.44.

Table 3 Literature review on interobserver agreement of WHO grading systems of urothelial neoplasms

Given the strong prognostic impact of tumor grading, this high variability raised concerns about the appropriateness of clinical management strategies in a setting of uncertainty about reliable tumor grade. Consequently, a new classification and grading system, subsequently known as the 1998 WHO/ISUP system, has been proposed in 1998 [54] and was adopted in 2004 in the most recent WHO classification and grading system (Pathology and genetics: tumours of the urinary system and male genital organs, one of a series of WHO “Blue Books”). The most important changes in comparison to the 1973 WHO system were (1) the introduction of a new category of non-invasive papillary urothelial tumors, referred to as papillary urothelial neoplasms of low malignant potential (PUNLMP), (2) a detailed histological description of the different categories of non-invasive papillary urothelial tumors and (3) the collapsing of the formerly three-tiered grading system (G1, G2, G3) to a two-tiered system (low-grade vs. high-grade) for both non-invasive papillary carcinomas and invasive carcinomas in general. A comparison of the 1973 and 2004 WHO grading systems is shown in Fig. 1.

Fig. 1
figure 1

Comparison of the 1973 and 2004 WHO grading system for non-invasive papillary urothelial neoplasms and invasive urothelial carcinomas in general. The 1973 WHO grade 1 non-invasive papillary urothelial carcinomas are reassigned, some to the papillary urothelial neoplasm of low malignant potential (PUNLMP) category, some to the low-grade non-invasive papillary urothelial carcinoma category. Grade 2 non-invasive papillary urothelial carcinomas (1973 WHO) are reassigned, some to the low-grade and the remaining to the high-grade non-invasive papillary urothelial carcinoma category. All grade 3 non-invasive papillary urothelial carcinomas (1973 WHO) are reassigned to the high-grade non-invasive papillary urothelial carcinomas. WHO World Health Organization, PUNLMP papillary urothelial neoplasm of low malignant potential

An important goal of the 2004 WHO classification was to improve reproducibility in diagnosis and grading among different pathologist by providing detailed histological criteria for each diagnostic category. To date a significant improvement in intra- and interobserver variability as compared to the 1973 WHO system has not been found (Table 3). Beyond others, this is reflected by the fact that in five different studies, in which non-invasive papillary urothelial tumors were graded according to the 1998 WHO/ISUP hence 2004 WHO system, the incidence of PUNLMP varied from 12 to 39%, that of low-grade carcinoma varied from 27 to 63%, and the incidence of high-grade carcinoma varied from 21 to 67% [6266]. More specifically, Murphy et al. [67] found only slight to moderate interobserver agreement for PUNLMP and low-grade carcinomas among three pathologists (κ = 0.12–0.50), compared to substantial agreement for high-grade carcinomas and carcinoma in situ (κ = 0.75–0.82). In a study by Campbell et al. [68] interobserver variability of the 1998 WHO/ISUP system was found to be moderate (κ = 0.45) and the level of agreement could not significantly be increased even if the pathologists reviewed the cases together and reached a consensus diagnosis (κ = 0.60). Yorukoglu et al. [69] investigated intra- and interobserver agreement of both the 2004 WHO and the 1973 WHO system in a setting of 30 cases of non-invasive papillary urothelial tumors and six pathologists after having provided a teaching set to each study participant. No significant differences neither in intraobserver reproducibility (2004 WHO: κ = 0.67, range 0.45–0.89 vs. 1973 WHO: κ = 0.66, range 0.45–0.89) nor in interobserver reproducibility (2004 WHO: κ = 0.56, range 0.42–0.65 vs. 1973 WHO: κ = 0.48, range 0.19–0.65) became evident. In a recent study of Gönül et al. [56] two pathologists assigned a tumor grade according to the 1973 WHO and the 1998 WHO/ISUP (=2004 WHO) system to 258 consecutive papillary urothelial carcinomas. Regardless of the pathologist, tumor grades of the two grading systems correlated to each other and to the pathological stage. The overall agreement between pathologists was somewhat higher in the 1998 WHO/ISUP (=2004 WHO) system (κ = 0.59) than in the 1973 WHO system (κ = 0.41), but both κ values were still within the range of a moderate agreement. Thus, in summary the studies performed so far suggest that the reproducibility of the 1998 WHO/ISUP (=2004 WHO) system does not appear to be appreciably different from that of the 1973 WHO classification.

Interestingly, however, in the study of Gönül et al. [56] the level of interobserver agreement of the 1998 WHO/ISUP (=2004 WHO) system considerably differed, when different tumor categories were compared. While the highest level of grading agreement was found in pT1 carcinomas (κ = 0.91), the lowest level of agreement was observed, when only tumors of the PUNLMP and the low-grade non-invasive papillary carcinoma categories were included in the analysis (κ = 0.26). Similarly, Murphy et al. [67] reported a 50% discrepancy rate among pathologists attempting to distinguish between PUNLMPs and low-grade papillary urothelial carcinomas even after a period of structured education. Accordingly, in a study of Yorukoglu et al. [69] mean rates of agreement for PUNLMP, low-grade non-invasive papillary urothelial carcinoma, and high-grade non-invasive papillary urothelial carcinoma were 48, 72.7, and 92%, respectively. Apparently, the yet unsatisfying overall levels of interobserver agreement in grading of non-invasive papillary urothelial carcinomas according to the 2004 WHO system can largely be attributed to the fact that the histologic distinction between PUNLMP and low-grade non-invasive papillary urothelial carcinoma causes major difficulties, even for experienced pathologists and although detailed histological criteria for these categories have been provided.

This raises the question as to whether a distinction between PUNLMP and low-grade non-invasive papillary urothelial carcinoma is of any prognostic and clinical use, because only such a relevance would justify to stick to this classification. Intriguingly, studies on the prognostic and clinical relevance of the 1998 WHO/ISUP (=2004 WHO) system were performed only after its publication. In these studies PUNLMPs have been reported to recur in up to 60% and to progress to invasive carcinoma in up to 8% of the cases, whereas low-grade non-invasive papillary urothelial carcinomas recurred in up to 77% and progressed in up to 13% of the cases [55, 62, 68, 7074]. Overall, differences in aggressiveness hence prognosis of PUNLMPs and low-grade non-invasive papillary urothelial carcinomas, reported so far, seem to be slight rather than pronounced. Consequently, the clinical management of patients with PUNLMP or low-grade non-invasive papillary urothelial carcinomas is currently similar if not identical [55]. From this one might conclude that no differentiation between these categories is needed. However, given the strong interobserver variability among pathologists in distinguishing between these two categories and given the knowledge about the biological heterogeneity of low-grade non-invasive papillary urothelial tumors (including PUNLMPs), it might be promising—prior to abandon this classification—to search for additional (e.g. molecular) markers, which together with the established histological criteria allow a more precise distinction between prognostic hence clinically relevant subgroups.

Another aspect contributing to interobserver variability in grading of urothelial tumors is the well known fact of tumor heterogeneity. Different grades are often found within a given tumor and in general the overall grade is based on the highest-grade area identified within a tumor. However, similar as for renal cell carcinoma, the minimum size required for such an area to be considered significant has not yet been standardized. Consequently, some pathologist assign a high tumor grade in any case that a high-grade area is present. In contrast other pathologists assign a high tumor grade only when the high-grade area comprises more than 5% of an otherwise low-grade tumor.

Conclusions

Like other tumors, urological tumors, are known to be both biologically and morphologically heterogeneous. Consequently, histological grading systems possess an inherent degree of subjectivity, giving rise to both intra- and interobserver variability. In general, reproducibility levels of the most commonly used grading systems of urological tumors are fair to moderate and grading of low-grade tumors provides more difficulties to pathologists than grading of high-grade tumors. Nevertheless for most urological tumors, it is well established that grading is an important factor in predicting their biological aggressiveness.

With regard to prostate cancer, structured education was shown to significantly improve reproducibility in Gleason grading and consequently several teaching facilities have recently been established. Fuhrman grading of renal cell carcinomas is only fairly to moderately reproducible and collapsing the original four-tiered grading system to a two-tiered grading system seems to improve the reproducibility only marginally. Studies addressing the value of structured teaching in Fuhrman grading have not been reported yet. A more precise definition of how to grade heterogeneous tumors with special emphasis on the minimal amount of high-grade areas, required to upgrade an otherwise low-grade tumor, will help to improve grading reproducibility, but most likely only to a limited extent. Therefore, it appears that a new grading system, which possibly also includes molecular markers, needs to be established. Grading reproducibility of urothelial tumors using the 2004 WHO system appears to be largely hampered by the difficulty to distinguish between PUNLMP and low-grade non-invasive papillary urothelial carcinoma and studies suggest that this difficulty cannot be overcome by structured teaching. Apparently, the distinction between PUNLMP and low-grade non-invasive papillary urothelial carcinoma cannot reliably be made based on the so far established histological criteria and rather requires the identification of specific (e.g. molecular) markers. As long as no such markers are available and the prognostic hence clinical relevance of the distinction between PUNLMP and low-grade non-invasive papillary urothelial carcinoma has not been established, the new terminology used in the 2004 WHO classification is of questionable validity and utility.