Introduction

Renal transplantation is considered the best therapeutic option for patients with end-stage renal disease, but demand of donor organs still exceeds availability. To increase the deceased donor pool, expanded criteria donors (ECD) comprising donors aged > 60 years or aged 50–59 years with either hypertension, serum creatinine (SCr) > 1.5 mg/dL or death from a cerebrovascular accident, are considered [1], with a pre-implantation biopsy often undertaken to assess the suitability of the kidneys. There are conflicting conclusions as to value of a biopsy and which histological parameters are the best indicators of organ outcome [2]. Comparison is difficult because scoring systems, type of biopsy (wedge vs needle core), and outcome indications differ; in addition pathologist’s expertise -known to influence the correlation with outcome- is not specified or taken into account. The majority of studies consider percent glomerulosclerosis (GS) > 20% in a wedge biopsy of renal cortex as the cut-off between acceptance and discard [3]. Wedge biopsies are often preferred due to lower rates of complications [4]. They however often are unreliable for assessing arterial fibro-intimal thickening/arteriosclerosis [5] and overestimate the GS and interstitial fibrosis (IF), particularly if an area of subcortical scarring is sampled [6]. Sampling error of pre-implantation biopsies can be further assessed by comparing the biopsy with a larger representative sample in the discarded whole kidney [6,7,8,9]. The majority of studies base conclusions about the reliability of pre-implantation biopsies on outcome of the kidney transplanted. There is no agreed “best” grading system for pre-implantation biopsies [2]: several systems are based on the addition of the Banff grades, such “Donor score” [10], or based on Pirani such Karpinski/Remuzzi [11, 12]. The Maryland aggregate pathology index (MAPI) [13] was developed based on statistical associations with long-term kidney outcome and included morphometry, similarly to “CIV score” [14] and “Donor chronic damage score” [15]. Addition of clinical factors to create a composite scoring systems has also been developed [16, 17], similarly to scoring systems for potential oncological recipients [18]. However, irrespective of the scoring system, it is important to remember that the biopsy is assumed to represent the state of the kidney. In most studies data is derived from on-call pathologists with little or no specialist training in renal pathology, however under these circumstances outcome data have little association with histology as opposed to specialist renal pathologists [10, 19]. It is important to also remember that using an adequate kidney whilst not of optimal function still has a survival advantage or no excess mortality for an older recipient [20]. In Italy pre-implantation biopsies are widely performed on ECD by on-call pathologists with no expertise in renal pathology, with an old-for-old matching policy, resulting in an increase in utilization of these donors.

The aim of this study is to compare agreement between general and specialist pathologists in the assessment of Remuzzi score and of extra Banff variables on pre-implantation biopsies, and to compare the scores obtained by biopsy with scores on the discarded whole organ to assess biopsy sampling error.

Methods

Case selection and processing

Forty-six discarded kidneys from 36 donors were retrieved from the archive of the Hospital Trust of Verona, in the period January 2013–October 2018. As biopsies were sometimes repeated at the implanting centre if the original biopsy had been performed elsewhere, there were 75 pre-implantation biopsies taken from these 46 kidneys. The distribution of donor biopsies and discarded kidneys is shown in Table 1. Fifty-one (68%) of the biopsies were rapidly formalin-fixed and paraffin-embedded (FFPE) using microwave technology and 24 (32%) were frozen sections (FS). There were 62 (82.7%) wedge biopsies and 13 (17.3%) needle core biopsies. All the biopsies were performed in Verona and were available for review. The original reports by the 12 on-call pathologists were acquired. Discarded kidneys were sampled according to the Royal College of Pathologists of Australasia (RCPA) guidelines for non-tumor kidney specimens [21] with a minimum of three blocks of representative parenchyma. Between 3 and 17 hematoxylin–eosin (H&E) sections at 10 µm cutting level for each biopsy and between 3 and 15 sections of organs were examined. No special stains were performed. Biopsies were examined by three general and two specialist pathologists whilst all sections of the discarded kidney were examined by a single specialist pathologist.

Table 1 Distribution of biopsy and discarded organs related to donors

Histological scoring

The original on-call pathologists’ Remuzzi grades were obtained from the original reports, for GS, tubular atrophy (TA), IF, vascular narrowing/arteriosclerosis (AS), with no detailed estimate of percentages. A biopsy was reported as adequate when at least 25 evaluable glomeruli and 2 arteries were present. All the biopsies irrespective of adequacy were assessed and graded for agreement among pathologists, but inadequate biopsies (n = 1) were further excluded for comparison with organs. According to Remuzzi et al. [11] and with the integration of the Nord Italian Transplant program indications, at Hospital Trust of Verona, kidneys are usually considered suitable for single transplantation with score 0–4, suitable for double transplant with score 5–6 and to be discarded with score ≥ 7, considering also that indications are not strictly mandatory and that in some situations case-by-case judgment of surgeon can be allowed.

Each biopsy was reviewed by three general pathologists, not involved in the original reports, and by two pathologists with expertise in renal pathology and a Remuzzi score performed, blinded to the original report. In addition, all pathologists also scored the biopsies for additional parameters in the Banff pre-implantation consensus document [22]. The scoring scheme for both Remuzzi grades and additional Banff parameters are shown in Table 2.

Table 2 Scoring scheme for Remuzzi grades and additional Banff parameters

Discarded kidneys

A single specialist pathologist reviewed all the sections from each kidney, blinded to the results of the corresponding biopsies and assigned the scores based on the worst grade encountered across all slides. For each slide 100 glomeruli were assessed to determine GS and the mean of all counts was taken as the final value.

Comparisons

At first, we compared the reproducibility of general pathologists (original on-call report plus three general) versus specialist pathologists for Remuzzi grades. Next Banff parameters of the 3 general and 2 specialist pathologists were compared. Lastly, we compared the Remuzzi grades on biopsies, both reported by on-call pathologist and those reviewed by a single specialist, with the grades assessed on organs by the single specialist.

Statistics

Intraclass correlation coefficient (ICC) was used to measure reproducibility between pathologists [23]. ICCs and associated 95% confidence intervals (CIs) for all parameters, for the two groups of general (three pathologists and the original reports’ results, assumed to be representative of all general pathologists) and the two specialist pathologists, were calculated using a two-way random-effects, agreement-type model. An ICC of > 0.5 was considered to represent adequate agreement, with excellent > 0.75, good as 0.5–0.75, fair as 0.25–0.5 and poor as < 0.25. Comparisons between the score on biopsy, both reported by the on-call and one specialist pathologist, and score on the organ were made separately. When comparing continuous/ordinal biopsy findings, differences were compared using the Wilcoxon signed rank test for paired data. For each comparison, only pairs with adequate biopsy and all available results were included (n = 45). The difference was considered significant at a level of p < 0.05. To further assess the agreement between the scores on biopsies and on organs weighted κ coefficient was used both between scores of on-call pathologist and organs and between scores of specialist and organs. CIs for the weighted κ values were obtained by bootstrapping. The level of agreement for κ coefficients is generally accepted as follows: 0.0–0.2, slight; 0.21–0.4, fair; 0.41–0.6, moderate; 0.61–0.8, substantial; and 0.81–1.0, almost perfect. All the analyses were carried out with open-source software R 3.5.2 (R Foundation for Statistical Computing, Vienna, Austria).

Results

Donor data

There were 21 (58.3%) male and 15 (41.7%) female donors, aged 44–87 (mean 71.8 ± 10.6). Donor SCr value was reported to be within normal limits with no other specification in 12 (33.3%) cases, ranged between 0.5 and 1.1 mg/dL (mean 0.92 ± 0.26) in 16 (44.4%) cases and was not recorded with only reporting of poor function in 8 (22.2%) cases. Cause of death was available for 16 donors (44.4%), of these: 14 (87.5%) died from intracranial hemorrhage and 2 (12.5%) died of cardio-respiratory failure. The kidney was discarded because of damage to the kidney in 16 (34.8%) cases, Remuzzi score in 7 (15.2%) cases, other pathology of the organ in 9 (19.6%) cases, turned down by recipient transplant center in 4 (8.7%) cases, other unrecorded reasons in 10 (27.8%) cases.

Comparison between general and specialist pathologists

For the variables used to determine biopsy adequacy (number of glomeruli and arteries) specialist pathologists obtained excellent agreement whilst general pathologists obtained good agreement. For the variables of Remuzzi score, the agreement was good for GS and fair for TA by both groups, whilst for the other two variables the specialists achieved greater agreement than the generalists: IF agreement fair vs. poor; AS good vs. fair, respectively. Despite the poorer agreement of the general pathologists on some variables, both specialist and general pathologists achieved an agreement in the range of good on the total Remuzzi score (the addition of the 4 variables), with a higher value for the specialists. The additional features from the Banff consensus document showed far better agreement by specialists than general pathologists with all variables having poor agreement by the general pathologists and at least good by the specialists. The ICCs for all parameters and groups are shown in Table 3.

Table 3 Intraclass correlation coefficient (ICC) values for both groups of pathologists

Comparison of scores between biopsies and discarded organs

The Remuzzi score was significantly higher in biopsies compared to the discarded kidney (mean 4.24) when assessed by both the on-call (mean 5.49; p < 0.001) and specialist pathologist (mean 4.89; p < 0.001). The scores for all Remuzzi variables were significantly higher in the biopsy when scored by the on-call pathologist, whilst only IF and AS scored by the specialist. The GS was however significantly greater in biopsies than organs when assessed as a percentage (data only for the specialist, p = 0.029). None of the organs had a score ≥ 7, 33 (71.7%) had a score ≤ 4 and 13 (28.3%) had a score 5–6. Comparisons of scores in biopsies and organs are summarized in Table 4, while distribution of scores is represented in Figs. 1 and 2.

Table 4 Comparison between scores on biopsy and organ and κ coefficients for general and specialist pathologist
Fig. 1
figure 1

Distribution of scores for the single Remuzzi grades according to the on-call general pathologist, specialist pathologist, and discarded organ

Fig. 2
figure 2

Distribution of total Remuzzi scores according to the on-call general pathologist, specialist pathologist, and discarded organ

Agreement between biopsies and discarded organs

Weighted κ for all scores are summarized in Table 4. There was moderate agreement between the total Remuzzi score in the biopsy and organ for the specialist (κ 0.5443), whilst the on-call achieved only fair agreement (κ 0.3956). The agreement by the specialist for the individual variables varied from substantial for TA, to fair for GS and IF and slight for AS; while the on-call achieved moderate agreement for AS and fair for the other variables.

A representative image of biopsy overcalling GS and IF with corresponding discarded organ is shown in Fig. 3.

Fig. 3
figure 3

Example of sampling error on biopsies. In a and b, a biopsy with sclerosed glomeruli in the subcapsular region leading to overscoring; in c and d the corresponding organ with the majority of glomeruli not sclerosed throughout the full-thickness of parenchyma. Original magnification: a 10×; b 20×; c 5×; d 10×

Discussion

Our study has highlighted several important issues related to reliability of biopsy sampling in representing the overall state of the kidney and of the role of expertise in renal pathology in assuring reproducibility.

This study has found that pre-implantation biopsies have significant sampling error and are not representative of the kidney as a whole, overscoring the amount of chronic damage. Total and single parameter scores were lower in organs compared to biopsies, and this remained true both when the biopsy was scored by a general and a specialist pathologist. The overscoring is more noticeable by general than specialist pathologists, suggesting that there is a risk of more organs being turned down or used as a double, that would have been suitable for a single transplant, and lessening the number of patients that have received a transplant. Indeed, if supposing to have the management of our discarded organs based only on bioptic score by on-call general pathologist, a quota of organs is erroneously discarded (7 organs, 15.2%). Concerning the 26 cases in which one of the organs was transplanted and the other one discarded, clinical information was available in 21 cases. Eleven out of twenty-one showed good renal function after transplant, while in 10 cases there was a residual post-transplant mild renal impairment. Among the 11 cases, there were a case with score 2, a case with score 3, three cases with score 4, five cases with score 5 and one with score 6. In these latter cases, due to good renal function and optimal macroscopic appearance of the organ, the surgeons decided to perform single transplant. Their contra-lateral kidneys were discarded for anatomical reasons and the scores on the whole specimen were lower than biopsies, suggesting that the amount of chronic damage was overscored and that the allocation as single transplant could be considered correct from a speculative point of view. Among the 10 cases with residual post-transplant mild renal impairment, there were two cases with score 6, four cases with score 5 and four cases with score 4. In particular, in the cases with score 4 the contra-lateral discarded kidney showed a value of 2 for arteriosclerosis/vascular narrowing in the organ, suggesting the importance of this feature value into the score.

General pathologists, who represent the majority of on-call pathologists assessing suitability of a potential donor kidney, overscored all parameters, whilst only IF was significantly higher in biopsies than in discarded organs when scored by the specialist. The percent GS was however significantly higher in the biopsies than in discarded organs, demonstrating a “true” sampling error of GS as the majority of biopsies were wedge biopsies. This finding is consistent with evidence of Muruve et al. [6] who has shown that wedge biopsies overcall GS, due to subcapsular overrepresentation with ischaemic-type GS. In younger live donors, wedge biopsies do not appear to overcall GS [5], presumably because this has not yet developed. In these younger donors vascular changes are underscored in wedge biopsies [5], presumably due to the lesser sampling of the larger arteries where the hypertensive-type chronic changes are seen.

Suboptimal scoring by general pathologists has been previously reported using the sum of Banff cut-offs (with modified GS cut-offs), when there was no correlation of the aggregate scores with outcome of the kidney unlike the scores of specialist pathologists [10]. Agreement between the scores of general and specialist pathologist was poor or fair for all variables other than GS which was near perfect [10]. General pathologists tended to have higher aggregate scores thus a tendency to overcall chronic damage similar to our findings. Whilst our study has not compared the general pathologists’ scores with the specialist pathologists’ ones, based on the mean scores of the each and the lesser correlation of biopsy scores with organ scores done by a specialist pathologist, it supports these findings.

A large study [8] comparing the original “urgent” report needle and wedge biopsies scores with the discarded kidneys found that both types overscored GS and underscored vascular changes. As in our study, a concordance κ index was used, but it is difficult to directly compare with this study for a number of reasons, including that the type of pathologist is not specified, however is likely to be general for the biopsies based on the urgent report and specialist for the discarded kidney, and a predominance of needle biopsies (77%) with 22% having 10 or fewer glomeruli and only 20% having at least 25 glomeruli. The overcalling of GS on biopsy in this study may both relate to sampling: subcapsular accentuation in wedge biopsies and the small number of sampled glomeruli in the needle biopsies as well as the scoring being done by general pathologists. The undercalling of vascular narrowing is likely sampling error, although the number of arteries are not provided. A survey of works dealing with agreement between general and specialist pathologist and between biopsies and whole organ is found in Table 5.

Table 5 Summary of studies dealing with comparison between biopsies and organs and/or agreement of pathologists

It has been recognized by many transplant realities that the Remuzzi cut-offs are resulting in kidneys being discarded or used as a double when they would have functioned at an adequate level as a single transplant [24,25,26]. It is possible that kidneys with higher grades of chronic damage may be suitable for double transplants based on these findings and thus further expand the utilization of kidneys from ECDs.

Agreement based on ICC was consistently higher in the specialist group. The agreement of general pathologists was found to be good or excellent for only the minority of parameters, being just fair or poor for most. Of the variables used for Remuzzi scoring, the best agreement was reached on GS, while for other parameters achieved only fair or poor agreement. Results are comparable with that obtained by Liapis et al., an important study comparing scoring with the ICC measure, where all the scoring pathologists were specialists [22]. Overall ICC values in Liapis et al. were lower than our specialist values; furthermore, the biopsy population was more variable, with similar quotas of wedge vs core biopsies and paraffin vs FS, while our biopsy set was mainly composed of wedge paraffin biopsies, the biopsy type which showed the highest ICC values also in Liapis et al. [22]. A previous work by Snoeijs et al. assessed reproducibility among three pathologists using ICC with a population of 44 pre-implantation FFPE biopsies. Reported ICC values were comparable with ours, even with stratification according to biopsy type but with no consideration of pathologist’s expertise [9].

A similar pattern was seen with the additional Banff variables assessed with less agreement by general pathologists. The value of assessing these variables in determining suitability of an organ is questionable. Extensive glomerular thrombi in donors can resolve, ATN in the donor does not impact graft survival, arteriolar hyalinosis has only some association with decreased eGFR and interstitial inflammation appears to have no impact on graft function [27, 28]. One reason for general pathologists having less agreement for these parameters is the fact that they only look at a renal biopsy very occasionally as part of this general on-call, whilst a renal pathologist looks for these features regularly. Another factor is that the scores based on the FS of the on-call pathologists were included. FS are undertaken to decrease the time to getting a Remuzzi score to the surgeon and thus decreasing ischaemia time to minimize early graft dysfunction. Tissue quality of FS is poorer with more artifact making interpretation more difficult, thus it is not surprising that FS have been shown to have lower reproducibility [22]. The lack of perfect agreement by specialist pathologists reflects inherent difficulties with turning a continuous variable into a grade [29] and this may be even more of a problem for the general pathologists lacking the familiarity with the specimens. Indeed, as the only quantitative parameter were glomeruli counts, for all other parameters a morphometry study could have provided a more reliable quantification to be compared with pathologists’ estimation.

Improvement in reproducibility, particularly of parameters that are more difficult to grade by eye such as tubule-interstitial scarring can be obtained by the use of morphometry tools on digital whole slide images [30], particularly for the variables that even expert pathologists struggle to quantify on either a linear or semiquantitative scale like TA and IF [29], as explored in kidney biopsy in other settings [31]. The comparable ICCs between Liapis et al. [22] where only digital slides were assessed, and our study using conventional microscopy, together with our previous work support digital as a suitable way to assess these specimens [32]. Indeed, in a big center with general pathologists involved in on-call rotations it could be of advantage to have access to second-opinion consultations and/or to develop a remote network of specialist pathologists.

In summary pre-implantation renal biopsies overcall the Remuzzi score when compared with score of whole organ, which can result in suboptimal utilization of donor kidneys. This is particularly the case when general pathologists report the biopsies, when all variables are overscored. The overrepresentation of GS in wedge biopsies remains an issue even if evaluated by specialist pathologists. To optimize utilization of ECD kidneys consideration should be given to standardizing the type of biopsy to one with the least sampling error and the development of a specialist on-call pathology service and subsequent utilization of digital algorithms to increase reproducibility. Further studies are required to determine more appropriate cut-offs between single, double and discard categories based on outcome data.