Introduction

The management of kidney stones is often simple and straight forward but for more complex cases can present a number of challenges. Stone size, location, and density along with moderately variable patient and renal anatomy all play a role in deciding surgical approach and expected outcomes. The ability to preoperatively assess the cumulative complexity of a patient and their kidney stone and estimate stone-free rates, operative times, blood loss, and complications is imperative. This information would allow a urologist to counsel their patient toward the ideal surgical modality, apprise them of the potential need for multiple surgeries, and establish realistic expectations for outcomes.

Perhaps the most important information regarding the complexity of a patient’s kidney stone (and often multiple stones) can be derived from cross-sectional imaging. A non-contrast computer-assisted tomography (CT) scan provides detailed data regarding patient anatomy and a multi-dimensional assessment of the stone’s size, location, and density. Several groups have attempted to objectively quantify the relative complexity of a patient’s stone burden and relate it to various outcomes, most notably stone-free rates followed by estimated blood loss (EBL), complications, length of hospital stay (LOS), and quality of life following the stone procedure.

The term nephrolithometry has been used to describe the objective classification of stone burden and surgical complexity. The ideal nephrolithometry risk assessment tool would be easy to use, derived from information readily available as part of a patient’s preoperative work up and reproducible across practitioners. The primary goal is to derive an outcomes assessment for that individual patient. Secondarily, a widely adopted nephrolithometry tool would allow for the standardization of stone burden complexity when studying patient outcomes across modalities and institutions. Current trends in medicine are emphasizing quality outcome measures with potential links to reimbursement. An objective scoring system of patient-stone complexity may be used, in due course, to establish an expected outcome for each individual and at that point would become very relevant to all practicing urologists.

In this review, we will discuss various nephrolithometry tools which utilize preoperative imaging to assess stone complexity, their benefits and shortcomings, and their potential applications in clinical and academic settings.

The Guy’s Score

The Guy’s score was published in 2011 by Thomas and colleagues using risk factors of stone complexity derived from the available literature along with their internal expert opinions [1••]. They surmised that a greater number of stones, staghorn stones, stones in the upper calyces, and abnormal patient anatomy would increase the overall complexity of percutaneous nephrolithotomy (PCNL). The Guy’s score combines these variable and grade complexities from I to IV, grade I being the least and grade IV the most complex stones (Table 1). Grade I includes a solitary stone in the mid/lower pole or in the renal pelvis with simple anatomy, grade II includes patients with a solitary stone in the upper pole or multiple stones in a patient with simple anatomy or a solitary stone in a patient with abnormal anatomy, grade III includes multiple stones with existing abnormal anatomy or stones in a calyceal diverticulum or partial staghorn calculus, and grade IV includes any staghorn calculus or any stone in a patient with spina bifida or spinal injury.

Table 1 Overview of Guy’s score

Patients with a higher grade were less likely to be stone free (I—81 %, II—72.4 %, III—35 %, IV—29 %) and required more complex ancillary procedures (with ESWL, ureteroscopy, and PCNL being secondary procedures in increasing complexity) for management of clinically significant stone fragments (>4 mm). The group internally validated their grading system and also found it to be reproducible when a subset of patients was scored by three physicians blinded to outcomes (kappa 0.81).

One notable limitation of the Guy’s score was that it was studied using variable imaging modalities. The authors state CT, abdominal plain films (kidney, ureter, and bladder (KUB)), or intravenous pyelograms (IVP) were used when available. The Guy’s score did not significantly correlate with other perioperative factors such as blood loss or complications. In an external validation study by Mandal, a higher Guy’s score was predictive of lower stone-free rates following a single-staged PCNL and a higher complication rate for grades III and IV patients [2]. Ingimarsson also validated the Guy’s score as being somewhat reproducible (kappa 0.72) and proved that it can predict stone-free rates across increasing stringency of what a relevant residual fragment was: < 4 mm, < 2 mm, or no fragments [3]. Whether the Guy’s score is predictive of secondary outcomes remains to be elucidated; however, as an accurate model purely intended to estimate stone-free rates, the Guy’s score is simple and easy to use.

The S.T.O.N.E. Score

Inspired by the R.E.N.A.L. nephrometry [4] score for grading kidney tumor complexity for partial nephrectomy, Okhunov and colleagues developed the S.T.O.N.E. score [5••] using readily available measurements from a preoperative CT which have been shown to be clinically relevant in determining outcomes following PCNL. The scoring system includes stone volume, skin-to-stone distance, degree of obstruction, the number of calyces involved, and the essence (density) of the stone. The variables are graded by severity, assigned a value, and the sum of these values provide the final score ranging 5 to 13 (Table 2). The authors found that in their initial patient cohort, the S.T.O.N.E. score was an accurate predictor of stone-free status following PCNL (accuracy of 83.1 %). Additionally, the S.T.O.N.E. score correlated with operative time, EBL, and LOS.

Table 2 Overview of S.T.O.N.E. nephrolithometry scoring system

In a subsequent publication, Okhunov [6] showed that the total S.T.O.N.E. score (kappa 0.87) and its constituent components were reproducible across users. However, reproducibility was dependent on expertise as attending urologists and fellows fared slightly better than residents and significantly so over medical students. In an external cohort, Akhavein found that patients with a residual stone fragment 0–4 mm versus >4 mm had significantly different S.T.O.N.E. scores (8.9 and 10.3, respectively) [7]. The same group evaluated the scoring system for prediction of need for secondary procedures after primary PCNL. They found that patients requiring a secondary procedure had significantly higher scores compared to those who required only a single stage to achieve stone clearance (8.6 versus 9.9, p < 0.02).

The S.T.O.N.E. score is unique in that it takes into consideration the tract length which likely acts as a surrogate for body mass index (BMI), a factor which has been associated with complications in many surgical series, without needing the patient’s height or weight. The inclusion of tract length also recognizes the potential difficulty in gaining access to the calyx of interest and limited freedom to navigate the collecting system through a single access.

The S.T.O.N.E. score is simple to use and relies exclusively on information readily available from a preoperative CT scan—a nearly ubiquitous study for all patients now undergoing PCNL. Its ability to predict stone-free rates, EBL, and LOS are helpful in preoperative patient counseling but ultimately falls short in estimating complications.

The CROES Nomogram

Nomograms have the advantage of grading risk across a continuous scale rather than lumping patients into discrete groups. With the introduction of the Kattan prostate cancer nomograms, urologists have gained increasing comfort in their interpretation and clinical implementation. In 2013, Smith and other members of the CROES group compiled data from 2806 patients across multiple institutions to develop a nomogram to calculate the stone-free (<4 mm) probability following PCNL [8••]. The CROES nomogram considers stone size, location, and number of stones as well as prior surgery. Perhaps the most controversial variable is the center volume at which the case is being performed. The CROES nomogram acknowledges that in addition to the patient and stone characteristic, experience and expertise play a role in rendering a patient stone free. The variables are assigned a certain value, allowing for greater weight for more clinically significant data points, and are ultimately summed for a final score.

The calculation of the score can be cumbersome as it requires the nomogram in hand to do so; however, the ultimate result does provide an actual probability that the patient will be rendered stone free.

The CROES nomogram was internally validated using bootstrapping, and in a receiver-operating characteristic (ROC) analysis, it outperformed the Guys’ score (area under the curve (AUC) 0.76 versus 0.69, p < 0.001). The authors note that the nomogram performed well when predicting stone-free rates between 70 and 90 %, but faltered a bit in the lower stone-free rates.

A notable shortcoming of the CROES nomogram is that stone-free rates were determined using KUB and not a more sensitive CT scan. It is likely that the calculated score will in fact overestimate stone-free rates if patients are followed postoperatively with CT scans. Perhaps the greatest asset of the CROES nomogram is that it was developed from a multi-provider, multi-institutional, and multi-national data set increasing its likelihood of being reproducible in clinical and academic use.

Comparison of the Guy’s, S.T.O.N.E., and CROES Scores

The methodology and design of each scoring system is different, but all three are developed for the single purpose of predicting outcomes following PCNL. Each one has been internally as well as externally validated, and strength and limitations are clearly identified. Earlier this year, Labadie et al. published a head-to-head comparison of the Guy’s, S.T.O.N.E., and CROES scoring systems [9•]. The nephrolithometry scores for 246 patients from three academic institutions were calculated using each of the three tools, and the authors compared each system’s ability to predict outcomes such as stone-free (<2 mm) rates, 30-day complications, EBL, and LOS, among others.

The authors found that, within each system, there was a significant difference in scores between the patients who were deemed stone-free on postoperative CT scan and those with residual fragments >2 mm. The mean Guy’s scores were 2.2 versus 2.7, the S.T.O.N.E. scores were 8.3 versus 9.5, and the CROES scores were 222 versus 187, comparing stone-free versus residual fragments, respectively. Ultimately, AUCs were not significantly different between the three systems on ROC analysis illustrating that no tool was superior in discriminating which patient would be stone free following PCNL versus which would not. With regard to perioperative outcomes, only the Guy’s and the S.T.O.N.E. scores were predictive of EBL and LOS. Perhaps most disappointing, no system correlated with complications.

Additionally, the authors found that no scoring system was superior to stone burden alone in predicting stone-free rates. While this is only a single study, Labadie’s findings do scrutinize the clinical utility of the above nephrolithometry scoring systems and their ability to predict certain outcomes. At the very least, they highlight specific attributes a surgeon should consider when assessing stone complexity, such as stone size, number and location, extent of involvement, abnormal anatomy, etc., and at most provide an objective grading system to classify patients in an academic setting.

Given the almost equal predictive ability of each scoring system, their utilization will likely vary by surgeon’s preference. Alternatively, the S.T.O.N.E. score may perhaps be best utilized in an unremarkable patient with a complex stone, the Guy’s score used in patients with anatomic anomalies, and the CROES nomogram in studies comparing outcomes across various institutions and techniques.

Staghorn Morphometry

Mishra and colleagues brought to light the ambiguity of the term staghorn when assessing stone complexity [10••]. While modifiers such as complete and partial staghorn try to modulate implied complexity, there is no universal definition to convey objective information [11]. They propose that a more complete and clinically relevant assessment would be the “staghorn morphometry,” the volumetric distribution of the staghorn throughout the collecting system. Staghorn morphometry was determined using preoperative CT-urogram studies which were then analyzed with the 3D-DOCTOR™ software. Using the rendered 3D reconstruction, the authors were able to measure total stone volume (TSV) directly and subdivide the stone burden by pelvic, entry calyx, favorable calyx, and unfavorable calyx volumes. The entry calyx corresponded to the ideal calyx of access for PCNL. Favorable calyces were those with a wide (>8 mm) infundibulum, and an obtuse angle of access from the entry calyx and unfavorable calyces had a narrow infundibulum or an acute access angle [10••].

The authors’ main objective was to stratify staghorn stones into three subtypes: type 1—PCNL performed in a single stage through a single tract; type 2—PCNL performed through multiple stages or tracts, but not both; and type 3—PCNL performed through multiple stages through multiple tracts. Their findings show that TSV predicted the number of stages, while the percentage of stone burden in unfavorable calyces predicted both stages and tracts required to clear all stone. Type 1 stones had a TSV <5000 mm3 with an unfavorable calyx stone volume <5 % TSV or TSV >5000 mm3 but with an unfavorable calyx stone burden of no more than 2 %. Type 3 stones had a TSV greater than 20,000 mm3 with an unfavorable calyx volume of more than 10 %. All others were considered type 2 stones.

Mishra’s group makes great use of cross-sectional imaging to best delineate stone complexity with regard to PCNL. However, this information is heavily reliant on third-party software and an expert user to determine stone volumes and calyceal favorability. Both of these factors limit the accessibility and practicality of the widespread integration of stone morphometry as described above into regular clinical practice.

Seoul National University Renal Stone Complexity (S-ReSC) Scoring System

Recently, Jeong and colleagues published the S-ReSC scoring system to predict stone-free rates following single-tract PCNL [12••]. The S-ReSC score is derived by simply counting how many of the nine potential intra-renal compartments are occupied by stone on cross-sectional imaging. These compartments include the renal pelvis (1), the upper and lower major calyces (2, 3), and the anterior and posterior upper minor (4, 5), interpolar (6, 7), and lower minor (8, 9) calyces (Fig. 1). The user simply counts how many compartments contain stone with a potential score of 1 to 9.

Fig. 1
figure 1

Compartments include the renal pelvis (1), the upper and lower major calyces (2, 3), and the anterior and posterior upper minor (4, 5), interpolar (6, 7), and lower minor (8, 9) calyces

The authors found their system to have great intra- (kappa 0.98) and interobserver (kappa 0.83) reproducibility. As expected, lower stone-free rates were achieved in patients with rising S-ReSC scores (96, 69, and 28.9 % stone free for scores 1–2, 3–4, and 5–9, respectively.) Overall, the S-ReSC score performed well on ROC analysis (AUC 0.86). In an external validation study, Choo et al. confirmed that patients were less likely to be stone free with higher S-ReSC scores (83.9, 47.6, and 21.4 % stone free for scores 1–2, 3–4, and 5–9, respectively); however, the tool did not perform as strongly on ROC analysis (AUC 0.73) [13].

On secondary outcomes, a higher S-ReSC score correlated with longer operative times and EBL. While a trend was noted with higher complication rates, this was not statistically significant.

The S-ReSC score is easy to calculate and can be used as a continuous variable (1 to 9) or subdivided into categories (low 1–2, medium 3–4, and high complexity 5–9), both of which are predictive of stone-free status following single-tract PCNL. The authors demonstrated that their single variable of counting calyces involved was a significant, independent predictor of stone-free status after controlling for stone number, largest stone diameter, total stone volume, average Hounsfield units, and hydronephrosis. Although they do not demonstrate whether there is improved performance with the addition of these variables into their model, they in turn do not sacrifice simplicity for a slight improvement in an AUC which is comparable to more complicated models. Another advantage of the S-ReSC score is that the authors developed a modified version for ureteroscopy (retrograde intra-renal surgery, RIRS) which will be discussed further below.

There are several short comings of the Jeong study introducing the S-ReSC score that might limit its broader clinical use. A proportion of patients included in the study had their initial access obtained by a radiologist. Initial access during PCNL plays into the overall complexity of the procedure and should be considered. Additionally, urologist-guided access has been shown to be superior to that obtained by a radiologist with regard to stone-free rates and access-related complications [14, 15]. Also, by the virtue of their primary objective, only patients in which a single tract was utilized were included, and so potentially more complicated patients were excluded. Alternatively, the authors may have limited the prospects of achieving stone-free status through additional accesses as needed. Lastly, the mean BMI of all patients included (25) was slightly lower than the CROES (26) and fairly lower than the S.T.O.N.E. (30.6) study patients [5••, 8••]. It is unclear whether this slight variation would alter outcomes but since the S-ReSC score does not take any patient-specific factors into consideration, it bears some consideration.

Ultimately, validation of the S-ReSC score with a more diverse patient population may allow for its broader application in a Western clinical environment where higher BMI values are more common. Additionally, the current iteration of the S-ReSC score predicts the ability to achieve a stone-free status through only a single tract. While this information is of value, it should not preclude surgeons from performing PCNL in patients who can be stone free by employing additional accesses as needed.

Stone Complexity in Patients Undergoing Retrograde Intra-Renal Surgery (RIRS, Ureteroscopy)

Much of the literature estimating stone complexity is dedicated to patients undergoing PCNL. PCNL lends itself to this analysis given that it is the modality of choice for patient with larger, more complex stones. Unlike the R.E.N.A.L. nephrometry score which calculates kidney tumor complexity for surgeons deciding between a feasible partial versus more certain radical nephrectomy, there are no alternatives to PCNL for stones of the greatest size and complexity. Alternatively, RIRS is generally reserved for single to a few small to moderate-sized stones, or those that are radiolucent, precluding the use of shockwave lithotripsy. The ability to predict which patients were likely to fail RIRS and would instead benefit from PCNL would be of great clinical utility.

Resorlu-Unsal Stone (RUS) Score

In 2012, Resorlu and colleagues developed the Resorlu-Unsal Stone (RUS) score to help estimate “surgical complexity” and postoperative success following RIRS [16•]. They identified stone size, number of calyces involved, stone location, stone composition, and abnormal patient anatomy as factors which could preclude surgical success. They additionally recognized that access to the lower pole calyces can pose a challenge in RIRS, especially in patients with a narrow infundibulopelvic angle (<45° IPA). The authors performed a multivariate analysis including 207 patients to determine predictors of stone-free status (no fragments >1 mm) and found that each of the above parameters was an independent predictor of success except stone location. Additionally, stone composition would not be available preoperatively and so it was also excluded from their final model.

The RUS score is dependent on factors which can be identified on preoperative imaging. After determining specific cutoffs, the final model included a single point for each of the following: composite stone length greater than 2 cm, multiple calyceal involvement, IPA <45°, and abnormal anatomy (horseshoe or pelvic kidney). Possible scores range from 0 to 4. The authors found that a stone clearance rate of 97.1, 85.4, 70, and 27.2 % were achieved for patients with a score of 0, 1, 2, or ≥3, respectively.

Despite stone location not being a statistically significant independent predictor of clearance, it is hard to imagine that a 2-cm upper pole stone in a patient with a narrow IPA would be less likely cleared with RIRS than a similar patient with the stone located in the lower pole calyx. However, both patients would have 2 points using the RUS score. It is arguable that the latter patient would benefit from PCNL or required staged ureteroscopy for complete clearance.

Modified S-ReSC Score for Retrograde Intra-renal Surgery

Working off of their existing S-ReSC score for PCNL [12••], Jung and colleagues acknowledge that lower pole stones present an increased challenge relative to a similar stone in an upper or interpolar calyx. The modified S-ReSC score for RIRS counts anterior and posterior major and minor calyces, and the renal pelvis to comprise a total value. The modification is that lower pole calyces (minor and major) are given 3 points rather than 1. The total score range for the modified version is 1–12 [17•]. Jung illustrated that patients with a lower score (1–2) had a high likelihood of being stone free (94.2 %) compared to those with medium (3–4, 84 %) or higher (5–12, 45.5 %) scores. Granted, only three patients (4 %) had scores between 8 and 12 points. On ROC analysis, the predictive ability was high for the continuous (AUC 0.806) and three-tiered (low, medium, high; AUC 0.766) modified scores.

The group went on to apply the RUS score to the same cohort of patients allowing for an external validation. The highest score achieved using the RUS score in this cohort was 3 out of a total possible 4. There were no patients who had an anatomic abnormality, which raises the question of the practical utility of this variable. While the RUS score showed decreasing SFR with increasing score (93.6 % for 0, 78.1 % for 1, and 57 % for 2 points), both patients with a score of 3 were rendered stone free (100 %) despite being predicted to have residual stones. The predictive ability of the RUS score on ROC analysis was moderate (AUC 0.69), however, significantly less when compared to the continuous (p = 0.012) or tiered (p = 0.04) modified S-ReSC score for RIRS. Although, the superiority of the S-ReSC score would be more valid if both the RUS and modified S-ReSC scores were validated in an external cohort.

Perhaps the most interesting application of the original S-ReSC score for PCNL and the modified version for RIRS would be for patients with a moderately complex clinical picture in which RIRS and PCNL were both treatment options. Then, a physician could objectively compare outcomes using one modality versus the other and make an informed decision about what best serves the patient. The similarities between the two scoring systems would lend for easy calculation and outcome assessment with the same available clinical information.

Conclusions

The complexity of kidney stone management is in a large part due to the variable complexity of kidney stones themselves. Each patient presents his or her own challenge when considering their stone, their anatomy, and their available techniques and skillsets. Preoperative assessment of overall complexity is necessary for surgical planning and accurate patient counseling with regard to expected outcomes, especially stone-free status. There is perhaps no better imaging for kidney stones with regard to anatomic detail than a CT scan. However, being able to objectively discern what factors are actual contributors to kidney stone complexity and whether these would affect outcomes is a point of discussion.

Several groups have developed clinical models to help estimate stone-free rates relying on a combination of stone size, location, distribution, and density along with patient factors such as abnormal anatomy, patient size, and renal obstruction. The clinical utilization of these models should be encouraged, especially when several of the aforementioned factors lead to a more complex case which would benefit from staging or perhaps a more invasive but at the same time definitive procedure.

The general clinical utilization of complexity score systems will likely stem from their use in the academic setting first and a better illustration of their applications. It is not clear which system is best, but as mentioned before, they may each have varying applications for different clinical situations until an overall consensus is reached. Nonetheless, as cross-sectional imaging and backend software advance, it is not unforeseeable that a measure of complexity will be provided along with a standard report similar to a Bosniak [18] score or other widely accepted radiographic classification system. Current trends in health care have placed a large emphasis on quality with proposals to link incentives and possible reimbursement to quality-based measures and outcomes. However, critics of these policies are quick to point out that not every patient nor case is similar, and expected outcomes are highly dependent on overall complexity. Stone complexity scoring systems would be relied upon given their objective nature and reproducibility. It is not unforeseeable in that instance for nephrolithometry to emerge beyond an academic exercise and into mainstream urologic practice.