Introduction

Salvage therapies for the management of distant brain failure (DBF) following initial stereotactic radiosurgery (SRS) monotherapy, can have a substantial impact on medical cost-effectiveness [1]. SRS is approximately four-fold more expensive than whole brain radiotherapy (WBRT) but has been proven to be more cost-effective in selected patient populations [2, 3]. While the use of upfront WBRT may mitigate the likelihood of DBF [4], it is associated with toxicities such as fatigue [5] and cognitive decline [6]. Consensus guidelines and insurance reimbursement practices for the treatment of brain metastases have traditionally recommended the use of WBRT in patients with greater than four brain metastases, as these patients have a higher rate of DBF [7]. However, using a strict numerical cutoff to dictate which patients can receive SRS may both over-treat and under-treat certain populations [8], and there is a growing literature that the number of metastases alone is not indicative of outcomes [9].

In the decision analysis for upfront WBRT or upfront SRS, patients who rapidly develop multiple new brain metastases and require early WBRT may not be the best candidates for SRS. In this scenario, there are higher costs without a clinically significant postponement of WBRT-related toxicities. For this reason, we developed a model designed to predict both the risk and severity of DBF within the first several months after upfront SRS without WBRT. Prior attempts have been made to identify patients requiring early WBRT, but these models were generally limited by allowing physician discretion to influence the timing of WBRT [10] and the oversimplification of DBF as a single event rather than a heterogeneous phenomenon that can range from a single new metastasis to diffuse seeding of the brain [11, 12].

In the current study, we present a multi-institutional predictive nomogram for DBF that is both internally and externally validated. This was then validated with a second independent and contemporary dataset. We also compare the predictive ability of the nomogram for DBF against the current guidelines that use a pure numerical cutoff. Finally, we perform an analysis in which we determine nomogram values that predict for the number of metastases at time of DBF, and also define a nomogram value where a patient is at high-risk of being treated with early WBRT.

Materials and methods

Data acquisition

This study was approved by the Institutional Review Board at all participating institutions. Data was reviewed and collected at all eight institutions and initially encompassed patients treated from January 2000 through December 2013. Patients were included in the study if they underwent SRS for first diagnosis of brain metastases in the defined time period, and had one to ten brain metastases present at time of SRS. Electronic medical records were reviewed to determine clinicopathologic characteristics that were previously determined to affect DBF in our prior analysis [11]. Burden of extracranial disease [13] was defined based on prior published works. Radiosurgical doses were generally determined based upon the guidelines published by Shaw et al. for single fraction radiosurgical treatment of brain metastases [14].

At time of SRS planning and delivery, the majority (seven of eight) institutions utilized thin-sliced MRI with Neuroradiology interpretation with half of institutions relying on double/triple contrast to help identify metastases at time of treatment. The majority of patients underwent clinical and MRI follow-up 1–3 months after their SRS procedures and were subsequently followed every 3 months thereafter. DBFs and the number of brain metastases at DBF were determined based on serial imaging evidence of intracranial recurrence. DBFs were identified as any new metastases that developed outside of the previous radiosurgical target volume.

Prognostic model derivation

Across the eight SRS centers, we collected data on 1484 patients treated between 2000 and 2013. Our primary outcome was time to DBF, which is measured from time of initial SRS. We excluded patients missing any covariate used in our previous analysis [11], reducing the sample size to a total of 1354 patients. We then divided the sample into a training (N = 685, four centers) and validation (N = 665, four centers) dataset. We estimated a prognostic index (PI) for time to DBF using the linear predictor () from a multivariable Cox proportional hazards (PH) regression model (higher values for the PI indicate a worse rate of DBF). The following variables were included as predictors in the Cox model: age, gender, histology, minimum margin dose delivered to any metastasis during the index SRS session, burden of extracranial disease (none, oligometastatic, and widespread), status of extracranial disease (progressive or stable), and number of lesions. We evaluated the PH assumption for all predictors using hypothesis tests based on scaled Schoenfeld residuals [15].

To evaluate the predictive ability of the PI in the validation dataset, we evaluated the discrimination and calibration of the nomogram as described by Royston and Altman [16] for external validation within the context of censored time-to-event outcomes. We estimated several measures to evaluate model discrimination, including Harrell’s c-index [17], Gönen and Heller’s K statistic [18], \(r_{D}^{2}~\) [19], and the explained relative risk measure of Heller [20]. For the c-index and K statistic, a value of 0.5 corresponds to random chance, while \(r_{D}^{2}\) and the explained relative risk measure are analogous to conventional r2 statistics from linear regression, ranging from 0 to 1.

Results are presented in terms of the cumulative distribution function \(F\left( t \right)=1 - S\left( t \right)\). To evaluate calibration, we mean-centered the PI in both datasets using the PI mean (2.045) estimated in the training dataset. We then divided each dataset into four prognostic groups based on the 16th, 50th, and 84th percentiles of the PI (mean-centered) from the training dataset. We estimated hazard ratios for each prognostic group in both datasets using Cox PH regression. Within the validation dataset, we then graphically compared the observed versus expected rate of DBF using three estimates of the survival function within each prognostic group: a Kaplan–Meier estimate versus two model-based estimates of covariate-adjusted survival. The model-based estimates only differ on whether the baseline cumulative hazard function (the non-parametric component of the Cox model) was estimated from the training dataset versus the validation dataset. Both model-based estimates compute the survival function conditional on the PI at all event times in the validation dataset \(\left( {S\left( {t|PI} \right)={S_0}\left( t \right){e^{PI}}} \right),\) and then averaging the survival function estimates at each time point within each prognostic group. While Royston and Altman used fractional polynomials to approximate the log of cumulative baseline hazard function [16], we found that fractional polynomials tended to over-fit our data, leading to estimates of the survival function that were not strictly non-increasing (results not shown). Therefore, we used a simple linear approximation to the cumulative baseline hazard function in our calculations (results not shown). A second validation was then pursued with a new contemporary cohort of 193 cases treated from 2011 to 2014 at a ninth academic center. All analyses were performed using the R Statistical Computing Environment (R Core Team [21]).

Results

Prognostic model for distant brain failure

Patient characteristics for the training and both validation datasets are displayed in Table 1. There were several statistically significant differences in characteristics of the populations in the training and validation datasets. Kaplan–Meier estimates of DBF at 3 and 6 months was 19.4 and 36.2% in the training dataset set, and 15.5 and 29.5% in the first validation dataset. For the new cohort (validation 2), KM estimates of DBF at 3 and 6 months were 21.6 and 39.8%, respectively. Cox regression estimates were used in constructing the PI in the training dataset (Supplemental Data). An increasing number of metastases, melanoma histology (HR 1.452, p = 0.0187), and increasing minimal margin dose (HR per one unit increase: 1.066, p = 0.0033) were the major factors associated with an increased rate of DBF. There was not any evidence of substantial deviation from the proportional hazards assumption of the Cox model (global test p value = 0.2968). The PI was used to generate the nomogram in Fig. 1.

Table 1 Characteristics of training and validation datasets
Fig. 1
figure 1

Nomogram for risk of distant brain failure at 3, 6, and 9 months after radiosurgery. Oligomet oligometastatic (<5 total metastases), RCC renal cell carcinoma, L Squam lung squamous cell cancer, L Adeno lung adenocarcinoma, Her2+ her2 neu positive breast cancer, Her2 (−) her2 neu negative breast cancer

Internal validation of the PI

We used bootstrap resampling (1000 replicates) to evaluate the calibration of the model for the PI within the training dataset. We used the bootstrap datasets to generate confidence intervals for the probability of DBF at 3, 6, and 9 months for every possible combination of the covariates used in constructing the PI from the training dataset. Figure 2a–c shows calibration curves delineating the predicted and observed probabilities of DBF. These curves generally indicate concordance between the observed and model-based predictions.

Fig. 2
figure 2

Calibration curves for a 3, b 6, and c 9 months. DBF distant brain failure, Mo months. d Calibration of distant brain failure probabilities in the validation dataset. X-axis is in months. Y-axis is distant brain failure probability. Two dash non-smooth lines are Kaplan–Meier estimates of the cumulative distribution function for the PI risk groups in the validation dataset. Solid smooth lines denote predicted, covariate-adjusted distant brain failure based on approximation to baseline cumulative hazard function estimated within validation dataset. Dotted smooth lines reflect predicted covariate-adjusted distant brain failure based on approximation to baseline cumulative hazard function estimated within training dataset

External validation of the PI

Table 2 presents estimates for measures of discrimination for the PI in both the training and validation datasets. In the validation dataset, we estimated Harrell’s c-index to be 0.631 (SE = 0.018), with Heller’s explained relative risk estimated to be 0.173 (SE = 0.048). These estimates imply a statistically significant ability of the PI to stratify patients according to their risk of DBF. This is further supported by the Harrell’s c-index of the contemporary validation set of 0.644 (SE = 0.030). When we divided patients in the validation dataset according to the percentiles of the PI from the training dataset, we similarly see an ability of the PI to stratify patients according to their risk of DBF.

Table 2 Discrimination measures and hazard ratios for prognostic groups based on percentiles of the prognostic index (PI) evaluated in the training and validation datasets

The Cox model fit to the validation dataset solely using the PI as a predictor yielded a coefficient for the PI of 0.985 (SE = 0.140, p = 0.9138) in validation 1 and 0.994 (SE = 0.195, p = 0.976) in validation 2. We then fit an additional Cox model to the validation datasets using all of the predictors included in deriving the PI, conditioned on the PI with its coefficient constrained to be one. A global test of all of the predictors (i.e. all β = 0) was not statistically significant in validation 1 (p = 0.602), but did indicate a small degree of lack of fit in validation 2 (p = 0.19). Figure 2d represents a graphical depiction of discrimination and calibration for the PI in the original and contemporary validation datasets. The Kaplan–Meier estimates illustrate the ability of the PI to appropriately risk-stratify patients, as the groupings defined by percentiles of the PI are correctly ordered in terms of their risk of DBF, with the exception of the two lowest risk groups being reversed in the second validation.

Comparative performance of nomogram versus current clinical practice

In order to compare the predictive ability of the nomogram to current clinical practices, we considered a simplified prognostic model solely based on the number of lesions (>4 vs. ≤4) to mimic current clinical practice based on available randomized studies. For that model, Heller’s explained relative risk was 0.056 (SE = 0.033) in the derivation dataset, and 0.054 (SE = 0.034) in the validation 1 dataset and 0.080 (SE = 0.063) in validation 2, which are all inferior to the nomogram.

Ability of the nomogram to predict early WBRT and velocity of new brain metastases per year

To determine a threshold value that predicts early use of WBRT, the training dataset was used to search all nomogram values from the 10th to the 90th percentiles. Heller’s maximum explained relative risk occurred when dividing the training dataset at the 76th percentile (182 points on the nomogram). The multi-institutional validation dataset similarly showed an increased incidence of WBRT above the 76th percentile. The hazard ratio for requiring WBRT within 3 months of SRS was 3.04 (95% CI 2.18, 4.23) in the training dataset at this threshold, and 2.09 (95% CI 1.23, 3.56) in the validation dataset. Figure 3a, b shows use of WBRT within 3 months for patients above and below the 76th percentile for both the training and validation datasets.

Fig. 3
figure 3

Probability of early whole brain radiotherapy for a training dataset and b initial validation dataset. The 76th percentile (182 points on the nomogram) represents the threshold at which there is maximum explained relative risk of WBRT within 3 months of radiosurgery. c Total nomogram points versus brain metastasis velocity (metastases per year on natural log logarithmic scale) for patients who experienced DBF in validation dataset. The solid back line represents a locally weighted polynomial regression fit (95% confidence interval between dashed lines). Overlaid contour lines (color) represent the relative densities of the plotted points. A statistically significant correlation was shown between total nomogram points and brain metastasis velocity (r = 0.38, p < 0.0001)

We then analyzed whether the nomogram correlated to the number of new metastases at time of DBF, measured in number of new brain metastases per year (brain metastasis velocity). Figure 3c graphically depicts the correlation between total nomogram points and the brain metastasis velocity (r = 0.38, p < 0.0001).

Discussion

The present study introduces an externally validated multi-institutional nomogram for the prediction of when patients will develop new brain metastases following SRS monotherapy. Such a tool may be helpful in triaging patients to either SRS or WBRT, a decision with significant implications on quality of life and health economics. A recent meta-analysis by Sahgal et al. found that patients younger than 50 years with a single metastasis had a significantly lower risk of DBF than patients with 2–4 metastases [22]. However, the brain metastasis population is heterogeneous, and there is emerging evidence that using the number of metastases as the sole factor to direct clinical decision-making may be an oversimplification of the problem [11, 23, 24]. In our own comparative analysis of the nomogram versus current guidelines, the nomogram was a better predictor of DBF than the number of metastases in the validation datasets, suggesting that there is a population of patients who receive WBRT that may have instead benefited from SRS.

While DBF is only a component of what drives the clinical decision for WBRT, it is a dominant factor that leads patients to require early WBRT after SRS alone [11, 25]. Nomogram values were predictive of the number of metastases that occur at treatment failure, which is commonly the factor that practitioners use to decide whether to treat further brain metastases with WBRT or further SRS. Other factors including local failure risk, life expectancy, leptomeningeal involvement, current cognitive function and practitioner bias contribute to this decision. Despite these other factors, DBF is an endpoint that significantly affects the cost of future care as it can lead to multiple salvage treatments given its prevalence and the cost of multiple SRS procedures [1, 2]. The use of salvage WBRT after SRS ranged from 5 to 20% amongst the SRS centers in the current study, suggesting that the standards for use of WBRT vary amongst institutions. However, as nomogram values were predictive of the number of metastases that occurred at treatment failure, patients who are at highest risk of early failure are also those at risk of multifocal failure.

Our analysis identified a high-risk cohort, which demonstrated a 2.1-fold increase in the likelihood of receiving early WBRT in the validation dataset. The identification of patients who will experience rapid DBF is probably the single most clinically useful result of the nomogram. Moreover, the nomogram’s calibration is best within the first 3 months after SRS. Those patients who require WBRT within 3 months may not have been appropriately triaged to upfront SRS as rapid DBF had generally occurred. A postponement of WBRT for greater than 3 months may represent a clinically meaningful time period in the survival of a patient with brain metastases since the cognitive toxicities of WBRT are often detectable by 4 months. Given a median life expectancy of 8–11 months for brain metastasis patients, cognition could potentially have been spared for a significant portion of a patient’s life expectancy [6]. The divergence of the natural histories of many cancers over time based on responses and tolerances to systemic therapy likely explains why the nomogram is less well-calibrated at later time points when control of systemic disease affects the later DBF rate.

There are several limitations to the current study. The use of multiple retrospective datasets does introduce the possibility of patient selection bias. Variations in provider treatment discretion and follow-up across institutions has likely contributed to the discrepancy between rates of events in the derivation and validation datasets. Moreover, molecular subtyping of the recorded primary malignancies was not available for most histologies in this analysis, but is suspected to contribute to the rate of DBF [26]. A preliminary analysis was performed for the available data suggesting trends in higher rates of DBF for triple negative breast cancer as compared to non-triple negative breast cancers in the pooled training and validation datasets. These differences were not appreciated for BRAF-mutant melanoma and EGFR lung mutations, but we caution its interpretation given the limited numbers present for analysis (data not shown).

Similarly, the contribution of improving systemic therapies continues to evolve with the continued integration of new effective therapeutic agents for extracranial disease. The era and patient population from which the nomogram was derived was witness to the development of new systemic treatments for lung cancer, breast cancer, renal cell carcinoma and melanoma. The likelihood of DBF will continue to change with the advent of newer agents that have the ability to change the likelihood of developing brain metastases. Because of this “moving target” effect, clinical tools such as the current nomogram will need periodic updating to account for the ability to control extracranial disease.

In the current era of precision medicine, it is clear that individualizing care for patients provides an improvement in care. Moreover, the use of nomograms in oncology has commonly helped to make difficult clinical decisions and properly assign resources [27]. Improvements to the current predictive nomogram in the future will likely come in the form of the integration of a systems biology component. Prospective validation of the nomogram is currently being planned at the multi-institutional level. Particular issues of interest for the prospective validation will be to better determine critical values on the nomogram that may trigger a decision for upfront WBRT and whether nomogram use mitigates the number of patients that are either over- or under-treated with SRS.