Introduction

Prognosis, together with diagnosis and treatment, is one of the three decisional processes of clinical medicine, and a fundamental element of public health as well. Prognosis of a given patient over a pre-defined time period is generally done by prognostic biomarkers and/or risk prediction rules, the latter being mathematical combinations of multiple prognostic factors (i.e. biomarkers and/or other quantitative and qualitative variables such as age and gender), to be able to calculate the probability of a specific outcome on individual basis. Furthermore, biomarkers and risk prediction rules should be intended as a support in clinical medicine and not be used alone.

Before being adopted in daily clinical practice a candidate risk prediction rule needs to be carefully developed in a representative patients’ cohort and externally validated in an independent series of patients affected by the same disease. Finally, a randomized controlled clinical trial would be ideally needed to demonstrate that the allocation of patients to specific treatments according to a given risk stratification tool leads to better outcomes as compared to those of patients allocated to a different risk stratification rule.

The evaluation of the prognostic accuracy of a risk score demands the assessment of discrimination, calibration and risk reclassification abilities of the same prediction rule. In this paper, using a simulation study, we describe how to formally calculate the Harrell’s C-index (to assess discrimination) [1], the Hosmer–May Test (an index of calibration) [2] and the integrated discrimination improvements (IDI, an index of risk reclassification) [3] to evaluate the accuracy of a new prognostic scoring system (the PREDICT model) for disease-specific mortality in a cohort of patients aged > 60 years with localized prostate cancer [4]. For didactic purposes we use a small sample of patients (n = 40) but the reader should be aware that the validation of a risk prediction rule requires a sample size adequately predefined according to a specific power analysis [5]. In the same example, we also describe how to compare the accuracy of PREDICT with that provided by the score proposed by Ash et al. [6]. In the Online Appendix, we also provide the clinical dataset used to explain the above mentioned concepts and the STATA commands that interested readers can be easily trained in performing data analysis.

Indexes of prognostic accuracy

Discrimination measures how much a predictive model distinguishes patients with a end point from those without a given endpoint (for example, mortality). Discrimination, as assessed by the Harrell’s C-index [1], ranges from 0.5 (no discrimination) to 1.0 (perfect discrimination). The higher the Harrell’s C-index, the higher the accuracy of the model in predicting the study outcome. The interpretation of the Harrell’s C-index is conceptually similar to that of a ROC curve analysis [7]. Briefly, the Harrell’s C-index represents the proportion of all possible pairs of patients in which the risk of death as estimated by the model agrees with the observed outcome. The concept underlying this index is that, under the assumption of random sampling, the predicted probabilities in patients who experience a given outcome should be systematically higher than those in patients who did not (tied observations are excluded). Calibration measures how much the outcome probability estimated by a predictive model matches the “real” probability of the same outcome. In calibration analysis, predicted and observed outcome probabilities are compared by the Hosmer–May test. When not significant, this test provides statistical evidence that predicted and observed outcome probabilities do not differ between them implying that the model is calibrated. To understand the difference between discrimination and calibration, we consider a hypothetical set of patients followed up for 20 years with an observed frequency of mortality of 40%. A prognostic model including clinical and laboratory variables which provides a 40% probability of death at 20 years in this patients’ cohort is perfectly calibrated (estimated and observed death probabilities coincide). Now let us suppose that another prognostic model (i.e. a new risk prediction rule) provides a 25% chance of death in all survivors and a 26% chance in all non-survivors. It is clear that the threshold of 25% (as derived by the new prognostic model) accurately discriminates patients who die from those who survive, but it is not calibrated because the estimated probability of mortality (25%) is much lower than the observed death probability (40%). Thus, the model is “discriminating” but underestimates the true risk of mortality. IDI is an index of risk reclassification which quantifies whether a new variable provides a clinically relevant improvement in prediction beyond and above that provided by a model based on a previous risk prediction rule and not including the same variable. The methodological background of IDI is that an adequate prognostic variable will increase the estimated risk for patients with the outcome of interest and will decrease the estimated risk for patients not experiencing the same event.

The prognostic accuracy of the PREDICT in the study cohort

PREDICT (pre treatment estimation of the risk of death in cancer of the prostrate) is a risk prediction rule for disease-specific mortality developed by Kerkmeijer et al. [4] in 3383 patients with localized prostate cancer enrolled between 1989 and 2008 in four university hospitals in Netherlands and Belgium. Such a risk score was specifically developed to help clinicians in therapeutic decision-making. Kerkmeijer et al. identified, by multivariate Cox regression analysis, clinical T stage, biopsy grade, PSA and age as correlates of mortality. The PREDICT score provides five prognostic strata: low risk category (score < 3), intermediate risk category (score ranging from 3 to 7), moderate risk category (score ranging from 8 to 11), high risk category (score ranging from 12 to 16) and very high risk category (>16). We tested the prognostic accuracy of PREDICT as categorical variable in a series of 40 patients aged > 60 years and affected by localized prostate cancer (see Table S1 in Online Appendix) and followed up for a median time of 15 years (range 1–26 years). Over a total person-time of 552 years, 24 patients died. According to PREDICT, at baseline 12 patients (30%) were classified at low risk, 9 patients (22.5%) at intermediate risk, 11 patients (27.5%) at moderate risk, and the remaining 8 patients (20%) at high risk. No patient was classified at very high risk. A Kaplan–Meier survival analysis (Fig. 1) shows that the cumulative survival reduces from the low risk category onwards and the difference in survival among the four groups is of high statistical significance (Log rank test = 29.3; P < 0.001). Furthermore, in a Cox model 1 unit increase in PREDICT categories entailed a 3.8 times increase (95% CI 2.2–6.5, P < 0.001) in the hazard ratio of death (Table 1 basic model). Although both the Kaplan–Meier and Cox regression analyses consistently indicate a significant and direct association between PREDICT and the hazard ratio of mortality, they do not provide quantitative information about the discriminatory power of this risk prediction rule for death in the study cohort. Therefore, to assess the discriminatory power of PREDICT we calculated the Harrell’s C-index and we found a value of 84.3%, a figure of high statistical significance (P < 0.001). Furthermore, the Hosmer–May test indicates that the prognostic estimates derived by PREDICT are calibrated because observed and predicted outcomes do not differ between them (Hosmer–May Wald Chi Squared = 2.28, P = 0.32). Overall, in the example including 40 patients aged > 60 years with localized prostate cancer, PREDICT accurately discriminates patients who died from those who survived and provides calibrated risk estimates.

Fig. 1
figure 1

Kaplan–Meier survival curves according to PREDICT categories

Table 1 Univariate and multiple Cox regression analyses for all-cause mortality of PREDICT either unadjusted (basic model) or adjusted (expanded model) for sedentary lifestyle

Does sedentary lifestyle increase the prognostic accuracy of PREDICT for mortality in the study cohort?

A fundamental issue in prognosis is whether a new variable increases the prognostic accuracy of a previous risk prediction rule in predicting the occurrence of a given outcome. In our study cohort, in a Cox regression model, sedentary lifestyle is significantly related to the risk of mortality (hazard ratio 2.8, 95% CI 1.2–6.5, P = 0.02). Given the fact that PREDICT developed by Kerkmeijer et al. does not include sedentary lifestyle, now we want to calculate to what extent taking into consideration this variable could increase the prognostic power of PREDICT for mortality in the study cohort. From a practical point of view, we should compare two prognostic models: a model based on PREDICT (basic model) and a model including PREDICT and sedentary lifestyle (expanded model). The two models are “nested” because the basic model (based on PREDICT alone) is a simplified version of the expanded model (PREDICT+ sedentary lifestyle).

First, we build a Cox model including the two variables (see Table 1 expanded model) and we find that both PREDICT and sedentary lifestyle are significantly related to the hazard ratio of mortality. It implies that sedentary lifestyle is associated with the mortality risk in the study cohort independent of PREDICT and vice-versa. However, although statistically significant in a Cox model, we cannot take for granted that sedentary lifestyle adds meaningful prognostic information to PREDICT in predicting mortality in the study cohort.

As shown in Fig. 2, the inclusion of sedentary lifestyle to the model based on PREDICT marginally increased discrimination (Harrell’s C-index from 84.3 to 86.1%, + 1.8% gain) and the goodness of fit (Chi Square = 4.6, P = 0.032 with 1 degree of freedom) and provided a poor and not significant integrated discrimination improvement (IDI, + 4%, P = 0.22). The goodness of fit of a statistical model specifically describes how well it fits a set of observations. Measures of goodness of fit (such as the Chi Square) typically summarize the agreement between observed and expected values, i.e. values estimated by the model.

Fig. 2
figure 2

Discrimination and calibration abilities of Cox models including PREDICT alone and PREDICT + sedentary life style (see text for more details)

Calibration did not materially change (Hosmer–May Wald Chi Square = 1.86, P = 0.39, see Fig. 2) after the inclusion of sedentary lifestyle into the model. Overall, these results indicate that sedentary lifestyle does not materially increase the prognostic accuracy for mortality of PREDICT in the study cohort.

Comparison of the prognostic power of PREDICT and Ash classification for predicting mortality in the study cohort

In a Cox regression model, the Ash classification [6] resulted to be directly and significantly related to the incidence of mortality in the study cohort [HR (1 category increase): 3.1, 95% CI 1.5–6.4, P = 0.002]. The question now is whether the prognostic information provided by PREDICT are comparable to those provided by an existing risk prediction rule, the Ash classification [6].

When comparing the model based on PREDICT with that based on Ash classification we deal with two prognostic models which are not nested between them, i.e. the one cannot be considered a simplified version of the other, as in the previous example. The correct approach to be applied in this instance is the Akaike Information Criterion (AIC) which does not contemplate the nested assumption. Given the fact that our example is based on a small sample size, we need to correct the AIC by a correction factor (AICc) which takes into account the number of patients and the degrees of freedom. The lower the AIC or AICc, the higher the prognostic accuracy of a predictive model. Because AIC is a “criterion” and not a “test”, the AIC calculation does not provide a P value.

To assess which one among the set of the two candidate models (PREDICT and Ash classification based models) is the best prognostic model, we analyze delta AICc and Akaike weights. To calculate delta AICc we preliminaryly identify the model with the minimum AICc and then we subtract this value from the AICc values of the remaining models. The delta AICc is used to derive the Akaike weights, i.e. the probability that a given model is the best one among the full set of candidate models. In our example, the model based on PREDICT has >99% chances to be the best model. Accordingly, both discrimination (84.3%) and calibration (Hosmer–May Wald Chi Square = 2.28, P = 0.32) provided by PREDICT are better than those of the Ash classification (discrimination: 73%; calibration: Hosmer–May Wald Chi Square = 1.02, P = 0.60).

Conclusions

Risk prediction rules are tools to predict the future course of a disease and to help clinicians to provide a prognosis or to decide to start/intensify a given therapy on individual basis. To evaluate the implementation of risk prediction rules in clinical practice, they should be carefully developed and externally validated by appropriate statistical methods such as measures of discrimination, calibration and risk reclassification. Finally, researchers should provide robust evidence, in the framework of a randomized clinical trial, that the use of a new risk prediction rule preludes better clinical outcomes as compared to those achieved by the adoption of an existing risk prediction model or standard clinical care.