Evaluation of performance

Steyerberg, E.W.

doi:10.1007/978-0-387-77244-8_15

E.W. Steyerberg²

Part of the book series: Statistics for Biology and Health ((SBH))

9617 Accesses
5 Citations

Background

When we develop or validate a prediction model, we want to quantify how good the predictions from the model are (“model performance”). Predictions are absolute risks, which go beyond assessments of relative risks, such as regression coefficients, odds ratios, or hazard ratios. We can distinguish apparent, internally validated, and externally validated model performance (Chap. 5). For all types of validation, we need performance criteria in line with the research questions, and different perspectives can be chosen. We first take the perspective that we want to quantify how close our predictions are to the actual outcome. Next, more specific questions can be asked about calibration and discrimination properties of the model, which are especially relevant for prediction of binary outcomes in individual patients. We will illustrate the use of performance measures in the testicular cancer case study, with model development in 544 patients, internal validation with bootstrapping, and external validation with 273 patients from another centre.

Access provided by Autonomous University of Puebla. Download chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

15.1 15.1 Overall Performance Measures

The distance between the predicted outcome and actual outcome is a central to quantify overall model performance from a statistical perspective.¹⁸¹ The distance is $Y - \hat Y$ for continuous outcomes. For binary outcomes, $\hat Y$ is equal to the predicted probability p, and for survival outcomes it is the predicted time to an event. These distances between observed and predicted outcomes are related to the concept of “goodness-of-fit” of a model, with better models having smaller distances between predicted and observed outcome.

15.1.1 15.1.1 Explained Variation: R²

The amount of explained variation (R ²) is an overall measure to quantify the amount of information in a model in a given data set. R ² is useful to guide various model development steps for all types of regression models commonly used in prognostic research, including linear and generalized linear models (e.g. logistic, Cox). With R ², we can readily compare the impact of different encoding of predictors, different shapes of the relationship of continuous predictors to the outcome, different selections of predictors, and the impact of including interaction terms (see previous chapters).

R ² is the most common performance measure for continuous outcomes. For generalized linear models, Nagelkerke’s R ² can well be used.³⁰⁹ As discussed in Chap. 4, this is a logarithmic scoring rule: (Y − 1) − (log(1 − p)) + Y × log(p). The logarithm of predictions p is compared with the actual outcome Y. For binary outcomes, the log likelihood for a patient with the outcome is log(p), without the outcome log(1 − p). When a very low prediction is made for a patient who actually had the outcome, this prediction has a severe score (Fig. 15.1). This may be a disadvantage for a prediction model that gives a prediction of 0 or 1 while the outcome is discordant.

15.1.2 15.1.2 Brier Score

An alternative for binary outcomes is to use a quadratic scoring rule, where the squared differences between actual outcomes y and predictions p are calculated. This calculation is done in the Brier score, which is simply defined as (Y − p)∧2. We can also write this similar as the logarithmic score: Y × (1 − p)∧2 + (1 − Y) × p∧2, with Y the outcome and p the prediction for each subject. For a subject, the score can range from 0 (prediction and outcome equal) to 1 (discordant prediction); a prediction of 50% has a score of 0.25 both when the outcome is 0 or 1. The Brier score is hence less severe than Nagelkerke's R ² in penalizing false predictions close to 0% or 100% (Fig. 15.1). The Brier score for a model can range from 0% for a perfect model to 0.25 for a non-informative model with a 50% incidence of the outcome. When the incidence is lower, the maximum score for a model is lower, e.g. for 10%, 0.1 × (1 − 0.1)∧2 + (1 − 0.1) × 0.1∧2 = 0.090. A disadvantage of the Brier score is hence that the interpretation depends on the incidence of the outcome.

Similar to Nagelkerke's approach to the LR statistic, we could scale Brier by its maximum score: Brier_scaled = 1 − Brier / Brier_max, where Brier_max = mean(p) × (1 − mean(p))∧2 + (1 − mean(p)) mean(p)∧2, with mean(p) indicating the average probability of the outcome. Brier_scaled ranges between 0% and 100%.

15.1.3 15.1.3 Example: Performance of Testicular Cancer Prediction Model

We consider a development sample containing 544 patients contributed by six study groups,⁴¹⁷ and a validation sample 273 patients treated at Indiana University Medical Centre.⁴⁶⁶ We developed a logistic regression model with five predictors: teratoma elements in the primary tumor, pre-chemotherapy levels of AFP and HCG, post-chemotherapy mass size, and reduction in mass size.

Internal validation of performance was estimated with bootstrapping (200 replications). Bootstrap samples were created by drawing random samples with replacement from the development sample. The prediction model was fitted in each bootstrap sample and tested on the original sample.

The essential R code is:

# 5 predictors in data set n544; develop testicular cancer model
Full <- lrm(NEC ∼ TER+PREAFP+PREHCG+SQPOST+REDUC10, data=n544)
val.prob(logit=full$linear.predictor, y=full$y) # apparent
validate(full, B=200) # Internal validation with 200 bootstraps
# External validation; refit model for matrix x and comparison of coefs
ext.full <- lrm(NEC∼TER+PREAFP+PREHCG+SQPOST+REDUC10, data=val, x=T, y=T)
lp <- ext.full$x % ^⋆ % full$coef [2:length (full$coef)] + full$coef [1]
val.prob(logit=lp, y=ext.full$y, riskdist=“predicted”) # external

Nagelkerke's R ² was 38.9% in the development sample, and slightly lower at internal validation (Table 15.1). At external validation, the R ² was estimated considerably lower, as 26.7%. Note that R ² is based on the difference between a Null model (“intercept only”) and a model with recalibrated predictions (intercept + calibration slope×logit of predictions).¹⁷⁴ So, the R ² is estimated after recalibration of the predictions.

Table 15.1 Overall performance of testicular cancer prediction model

Full size table

The Brier score was 0.174 and 0.178 at development and internal validation respectively. Remarkably, the Brier score was better at external validation (0.161). The external Brier score was simply calculated by comparing predictions with actual outcome, without recalibration as was done for R ². The interpretation of the Brier score is easier with the scaled version, which compensates for the fact that the maximum Brier score was lower in the external validation set (necrosis in 76 of 273 (28%); Brier_max, 0.20) than in the development set (necrosis in 245 of 544 (45%); Brier_max, 0.25). The scaled Brier score was clearly lower at external validation than at internal validation (20% vs. 28%, Table 15.2).

Table 15.2 Classification of subjects according to a cutoff for the probability of an outcome (event or no event)

Full size table

15.1.4 15.1.4 Overall Performance Measures in Survival

Nagelkerke's R ² can readily be calculated for survival outcomes, based on the difference in −2 log likelihood of a model without and a model with the linear predictor. Calculation of the Brier score is not directly possible because of censoring: Not all subjects are followed long enough for the outcome to occur. To address the censoring issue, we can define a weight function, which considers the conditional probability of being uncensored during time.^146,375,374 The assumption is that the censoring mechanism is independent of survival and the subject's history. We can hence calculate the Brier score at fixed time points. For example, we can compare predicted survival vs. observed survival at 1, 2, and 5 years of follow-up. Choosing many consecutive time-points leads to a time-dependent graph. This is useful to use a benchmark curve, based on the Brier score for the overall Kaplan-Meier estimator, which does not consider any predictive information. The survival estimates of the overall Kaplan-Meier curve only depend on time of follow-up, and are identical for all subjects alive at a certain point in time. An interesting example is provided by a case study on the disappointing contribution of microarray to prediction of survival for patients with diffuse large-B-cell lymphoma.³⁷⁴

15.1.5 15.1.5 Decomposition in Discrimination and Calibration

Overall statistical performance measures incorporate both calibration and discrimination aspects. For example, the Brier score can formally be decomposed into indicators of calibration and discrimination.^303,38 Discrimination relates to how well a prediction model can discriminate those with the outcome from those without the outcome. Calibration relates to the agreement between observed outcomes and predictions. Studying discriminative ability and calibration is often more meaningful than an overall measure such as R ² or Brier score when we want to appreciate the quality of model predictions for individuals. We therefore discuss these aspects further.

15.1.6 15.1.6 Summary Points

R ² is a common measure to express the amount of variability in outcomes that is explained by the prediction model
The Brier score is another common performance measure for the distance between observed and predicted outcome, which can be decomposed in discrimination and calibration aspects

15.2 15.2 Discriminative Ability

Model predictions need to discriminate between those with and those without the outcome (Event vs. No event). Several measures can be used to indicate how good we classify patients in a binary prediction problem. The concordance (c) statistic is the most commonly used performance measure to indicate the discriminative ability of generalized linear regression models. For a binary outcome c is identical to the area under the receiver operating characteristic (ROC) curve. The ROC curve is a plot of the sensitivity (true positive rate) against 1 – specificity (false-positive rate) for consecutive cutoffs for the probability of an outcome. We therefore consider sensitivity and specificity first.

15.2.1 15.2.1 Sensitivity and Specificity of Prediction Models

Sensitivity is defined as the fraction of true-positive (TP) classifications among the total number of patients with the outcome (TP/N _event), and the specificity as the fraction of true-negative classifications among the total number of patients without the outcome (TN/N _{no event}, Table 15.2). To classify a patient as positive or negative, we need to apply a cutoff to the predicted probability. If the prediction is higher than the cutoff, the patient is classified as positive, otherwise as negative. It is common to use a cutoff of 50% for classification. This cutoff is often not defendable in a medical context, as we will discuss in detail in the next chapter (Chap. 16). We can examine sensitivity and specificity over the whole range of cutoffs from 0% to 100%. The results can be plotted in an ROC curve.¹⁷²

15.2.2 15.2.2 Example: Sensitivity and Specificity of Testicular Cancer Prediction Model

If we classify patients as having necrosis when the probability of necrosis is over 50%, we have a sensitivity of 68% and a specificity of 77% (FP rate, 23%). With a higher cut-off, for example 70%, these numbers are 42% and 92%, respectively. This illustrates that a higher cutoff leads to better specificity, at the price of a lower sensitivity. This trade-off is visualized in an ROC curve (Fig. 15.2).

15.2.3 15.2.3 ROC Curve

A plot of an ROC curve has often been used in diagnostic research to quantify the diagnostic value of a test over its whole range of possible cutoffs for classifying patients as positive vs. negative. We can also make an ROC curve with consecutive cutoffs for the predicted probability of a binary outcome. We start with a cutoff of 0%, which implies that all subjects are classified as positive. The sensitivity is 100%, and the specificity 0% (upper-right point in Fig. 15.2). There are no false-negative classifications, and 100% false-positive classifications, since all subjects without the outcome are classified as positive. We then shift to a slightly higher cutoff, e.g. 1%, where sensitivity may still be 100%, but specificity above 0%. We follow all possible cutoffs till 100%, where all subjects are classified as negative. This is the lower-left point in Fig. 15.2. The sensitivity is then 0%, and specificity 100%. The curves are more to the upper left corner when the distributions of predictions are more separate between those with and without the outcome (Fig. 15.3).

We can draw a line between the 0%, 0% and 100%, 100% points, indicating a non-informative model. Note that the sum of TP and TN is 1 at every cutoff for such a model. This sum (also known as Youden's index) is larger than 1 for sensible prediction models.

The area under the curve can be interpreted as the probability that a patient with the outcome is given a higher probability of the outcome by the model than a randomly chosen patient without the outcome.¹⁷² An uninformative model, such as a coin flip, will hence have an area of 0.5. A perfect model has an area of 1. The interpretation hence is relatively straightforward, but assumes that we have a pair of patients, one with and one without the outcome. This is a rather artificial situation. Statistically, this conditioning on a pair of patients is attractive, since it makes the area independent of the incidence of the outcome, in contrast to R ² or the Brier score for example.

A generalization of the area under the ROC curve is provided by the concordance statistic (c).¹⁷⁵ The c statistic is a rank order statistic for predictions against true outcomes, related to Somer’s D statistic. As a rank order statistic, it is insensitive to errors in calibration such as differences in average outcome. For binary outcomes, c is identical to the area under the ROC curve.

Confidence intervals for the area under ROC curve (or c statistic) can be calculated with various methods. Standard asymptotic methods may be problematic, especially when sensitivity or specificity are close to 0% or 100%.⁹ Bootstrap resampling is a good choice for many situations. For example, differences in c between models fitted on the same data can be tested with standard formulas for the difference. But such formulas are only valid if the models were pre-specified. If one or both models were estimated on the same data, bootstrapping can be used for comparison of optimism-corrected estimates (see Chap. 17).

15.2.4 15.2.4 R² vs. c

We compare the behavior of Nagelkerke’s R ² and the c statistic in some simulations over a range of incidences of the outcome (1%, 10%, 50%, 90%, Fig. 15.4). At 50% incidence, a high c statistic such as 0.98 is associated with an R ² value of 87%. With lower incidence, R ² is somewhat lower.

15.2.5 15.2.5 Box Plots and Discrimination Slope

The discrimination slope has been proposed as a simple measure for how well subjects with and without the outcome are separated. It is easily calculated as the absolute difference in average predictions for those with and without the outcome.

Visualization is readily possible with a box plot (Figs. 15.5 and 15.7). The box plot may be a simple and intuitive way to communicate the extent of risk differentiation achieved by the model. The same information can be shown by histograms, which will show less overlap between those with and those without the outcome for a better discriminating model (Fig. 15.4). Similar to Fig. 15.4, the incidence of the outcome determines the visual expression that a box plot makes, and the magnitude of the discrimination slope. With low incidence, the slope is somewhat lower, for the same c statistic.

.

15.2.6 15.2.6 Lorenz Curve

An alternative way to judge discriminative ability is the Lorenz curve (Fig. 15.6). The Lorenz curve has been used in economics to characterize the distribution of wealth in a population.²⁶⁷ This curve has been used to plot the cumulative distribution of wealth against the cumulative distribution of the population, ranked on the basis of individual wealth

For prediction models we can plot the cumulative proportion of the population on the x axis, ranked by predicted probability. On the y axis, we plot the cumulative proportion of subjects with the outcome. For example, we can show the proportion of subjects developing cancer against the cumulative proportion of the population ranked by cancer risk.³¹ In terms of ROC curves, we plot the cumulative rate of false-negative classifications against the total of negative predictions. With incidences of the outcome around 50%, the ROC and Lorenz curves look very similar, except that the Lorenz curve is flipped vertically and horizontally. In case of a non-informative model, a straight line arises, since every rate of the population classified as negative corresponds to the same rate classified as negative among those with the outcome. A good model has a curve under this straight line, with a relatively large proportion of the population classified as negative having only a small part of the outcomes (low false-negative rate). On the upper end of the x axis, a small part of the population should contain many subjects with the outcome. In the ideal case, a cutoff is used that classifies the fraction as positive, equal to the prevalence, and all these have the outcome. Indeed, we note that a c statistic of 0.98 leads to a nearly horizontal line till the 50% cumulative proportion point on the x axis, and increases more or less linearly to 100% after that.

The Gini index is often calculated as a summary measure for the Lorenz curve. The Gini index is the ratio between the area (A) between the Lorenz curve of the prediction model and the line for a non-informative model and the area under the line for an non-informative model (0.5). Hence, G = 2A.

Table 15.3 Summary of some measures for discriminative ability of a prediction model for binary outcomes

Full size table

Other summaries are related to quantiles of the cumulative distribution. For example, we can consider the number of missed outcomes when 25% of the population is classified as negative. If we want to be sure not to miss the outcome, usually only few can be classified as negative, unless a model is used with very good discriminative ability. At the upper end of the range, we can consider how many outcomes are concentrated in the upper quartile (above 75 percentile). We will illustrate these percentiles for the testicular cancer prediction case study (Fig. 15.8).

An advantage of the Lorenz concentration curve is that the trade-off is clearly visualized between how many subjects can be classified as negative without missing many with the outcome. A disadvantage is that the appearance of the Lorenz curve depends strongly on the incidence of the outcome; with low incidence, the graph looks impressive, and with high incidence, the graph looks rather poor. As an example, consider a screening setting with 1% of subjects having the disease of interest. Only few cases with disease are missed at 25% classified negative when we use a model with a c statistic of 0.83. The top 25% then easily contains most cases. With a more-frequent outcome, more cases are missed at the point of 25% classified negative, and fewer of the cases are in the top 75 percentile.

15.2.7 15.2.7 Discrimination in Survival Data

For survival data, Harrell’s overall c statistic indicates the proportion of all pairs of subjects who can be ordered such that the subject with the higher predicted survival is the one who survived longer.¹⁷⁵ Ordering is possible if both subjects have an observed survival time, or when one has the outcome and a shorter survival time than the censored survival time of the other subject. Ordering is not possible if both subjects are censored, or if one has the outcome with a survival time longer than the censored survival time of the other subject. Some alternative definitions of c have been proposed, which lead to time-dependent performance curves.¹⁸³

In oncology, prognostic groups are often created after constructing a prognostic model. A common procedure is to base these groups on quartiles of predicted survival; the lower 25% should have the worst survival and the highest 25% the best survival. This approach can well illustrate the discriminative ability of a model. An example is shown in Chap. 23 (Fig. 23.8).

15.2.8 15.2.8 Example: Discrimination of Testicular Cancer Prediction Model

We continue the example of predicting a benign histology in testicular cancer patients after chemotherapy. The c statistic was 0.818 at model development, with small optimism according to bootstrap validation (decrease by 0.006 to 0.812). At external validation, the c statistic was 0.785, with a relatively wide 95% confidence interval of 0.73 to 0.84 (Table 15.4).

Table 15.4 Discriminative ability of testicular cancer prediction model

Full size table

The discrimination slope was 0.30 at model development, with small optimism according to bootstrap validation (decrease to 0.29). At external validation, the slope was much smaller (0.24). Part of this decrease is attributable to the lower average prevalence of necrosis (76 of 273, 28%, vs. 245 of 544, 45%). This lower prevalence is also evident from the box plots (Fig. 15.7).

The Lorenz curves were created with x axis as the cumulative fraction classified as necrosis, i.e. not having tumor, and hence classified as not undergoing surgical resection (Fig. 15.8). The y axis was the fraction of missed tumors, i.e. tumor masses left unresected. The point of 25% classified as necrosis corresponds to using a cutoff of 68% for the probability of necrosis; only patients with a probability over 68% are not resected. We miss 9% of the tumors with that cutoff. Hence, sparing surgery in 25% leads to missing 9% of the tumors. The point of 75% classified as necrosis corresponds to using a low cutoff (21%), and missing 58% of the tumors. Hence 42% of the tumors are concentrated in the upper quartile of the distribution.

At external validation, the curve looks worse, which is related to a lower discriminative ability and to a lower average prevalence of necrosis (28% vs. 45%). The 25% and 75% cumulative fractions correspond to cutoffs of 40% and 8% for the probability of necrosis, and lead to 13% and 65% missed tumors, respectively.

As a reference, we consider the current widely used policy of resection if the residual mass size exceeds 10 mm.⁴¹⁸ This policy uses only the five predictors in the model (post-chemotherapy mass size), and hence has less discriminative ability (the point is closer to the 45° line in Fig. 15.8). In the development sample, 107 of the 544 patients (20%) had residual masses <= 10 mm, but among them 30 with tumor (fraction tumor missed, 30 of 299, 10%). In the validation sample, only 9 of the 273 patients (3.3%) had residual masses <= 10 mm, but among them, 6 with tumor (fraction tumor missed, 6 of 197, 3%). Hence, the reference policy did not perform well in the validation sample.

15.2.9 *15.2.9 Verification Bias and Discriminative Ability

In the testicular cancer validation sample, only nine patients had very small residual masses. This reflects the policy for resection in the specific centre, where patients with such very small masses were not considered candidates for resection.⁴⁶⁶ This leads to verification bias; we do not know the histology of these masses, since they were not resected, and cannot evaluate predictions for these patients. We know that the estimation of regression coefficients is not biased by this selection, if we include the selection criterion (residual mass size) in the prediction model. Hence model predictions are valid even with verification bias.⁴⁹⁷ But performance measures such as sensitivity and specificity suffer from this verification bias.³⁰ The c statistic may not be affected too much because verification bias makes that we merely shift on the ROC curve to a different combination of sensitivity and specificity.

15.2.10 15.2.10 R Code

The boxplot is created simply with the boxplot command, based on a “full model,” including five predictors in the development data:

lp <- full$linear.predictors
boxplot (plogis (lp ∼ full$y) # Fig 15.7

The discrimination slope is the difference between the mean predicted probabilities by outcome:

mean (plogis (1p [full$y==1])) − mean (plogis (1p [full$y==0]))

Lorenz curves are created with the ROCR package:

library (ROCR)
# Make ROC object with predicted probability for outcome
pred.full <- prediction (plogis (1p), full$y)
# Lorenz curve data and plot
perf1 <- performance (pred.full, “fpr,” “rpp”)
plot (perf1, xlab=“NOT undergoing resection,” ylab=“with unresected tumor”)
abline (a=0, b=1) # Fig 15.8

15.3 15.3 Calibration

Another important property of a prediction model is calibration, i.e. the agreement between observed outcomes and predictions. For example, if we predict 70% probability of benign tissue for a testicular cancer patient, the observed frequency of benign tissue should be 70 out of 100 patients with such a predicted probability.

15.3.1 15.3.1 Calibration Plot

A calibration plot has predictions on the x axis, and the outcome on the y axis. A line of identity helps for orientation: Perfect predictions should be on the 45° line. For linear regression, the calibration plot results in a simple scatter plot. For binary outcomes, the plot contains only 0 and 1 values for the y axis. Such probabilities are not observed directly. However, smoothing techniques can be used to estimate the observed probabilities of the outcome (p(y=1)) in relation to the predicted probabilities. The observed 0/1 outcomes are replaced by values between 0 and 1 by combining outcome values of subjects with similar predicted probabilities, e.g. using the loess algorithm.¹⁷⁴ We can also plot results for subjects grouped by similar probabilities (quantiles), and thus compare the mean predicted probability to the mean observed outcome. For example, we can plot observed outcome by decile of predictions (Fig. 15.9). This makes the plot a graphical illustration of the Hosmer-Lemeshow goodness-of-fit test (see Sect.). A better discriminating model has more spread between such deciles than a poorly discriminating model. The choice of quantiles is important for the visual impression of calibration; if small groups are plotted, the variability will be large.

15.3.2 15.3.2 Calibration in Survival

In a survival context, the calibration of a model is usually studied at fixed time points. For these time points, we can consider grouped patients, with sufficient numbers per group to allow for calculation of survival rates with the Kaplan-Meier method. This observed survival is compared with the mean predicted survival from the prognostic model. Harrell suggests to use at least 50 subjects per group, depending on the hazard of the outcome.¹⁷⁴ It would be interesting to plot a smoothed curve as for binary outcomes, but this is not easy.

15.3.3 15.3.3 Calibration-in-the-Large

A calibration plot can easily be made for the data set used to develop a model. This indicates the apparent calibration. In model development, the average of predictions is the average of the outcomes: mean (Y) = mean(Ŷ). For example, mean(observed BP) = mean(predicted BP) in linear regression, and mean(observed 30-day mortality) is mean(predicted 30-day mortality). This correspondence is guaranteed by the intercept in a (generalized) linear model. This correspondence of average outcomes remains at internal validation with bootstrapping. When we apply the model to external data, this correspondence may be less. The difference between mean(Ŷ) and mean(Y _new) is referred to as “calibration-in-the-large.”

15.3.4 15.3.4 Calibration Slope

Another important calibration measure is related to the average strength of the predictor effects. For linear regression, we can write $Y_{{\rm new}} = a + b_{{\rm overall}} \hat Y$, and for generalized linear models f(Y _new) = a + b _overall linear predictor, where the linear predictor is the combination of regression coefficients from the model and the predictor values in the new data. A link function f is used for Y _new, e.g. log odds (or logit) in logistic regression. The b _overall is named the calibration slope.⁸⁶ Ideally, the calibration slope b _overall = 1. With apparent validation, b _overall = 1 because this yields the best fit on the data under study with either least squares or maximum likelihood methods. At internal validation, the calibration slope reflects the amount of shrinkage that is required for a model (b _overall < 1).⁸¹ It indicates how much we need to reduce the effects of predictors on average to make the model well calibrated for new patients from the underlying population. The calibration slope can hence be used as a shrinkage factor to adjust a model for future use (Chap. 14). At external validation, the calibration slope reflects the combined effect of two issues: overfitting on the development data and true differences in effects of predictors.

15.3.5 15.3.5 Estimation of Calibration-in-the-Large and Calibration Slope

For continuous outcomes, calibration-in-the-large can be assessed easily by comparing the mean$(\hat Y)$ and mean(Y _new), and testing the differences $Y_{{\rm new}} - \hat Y$, e.g. with a one-sample t-test. This test indicates the statistical significance of the mean under- or overestimation of the observed outcome: ${\rm mean}(Y_{{\rm new}} - \hat {Y})$. In a linear regression model, we can estimate an intercept a in the model with as outcome the residual $Y_{{\rm new}} - \hat {Y}: \ Y_{{\rm new}} - \hat {Y} = {\rm a}$. The recalibration model is simply $Y_{{\rm new}} = a + b_{{\rm overall}} \hat {Y}$. The deviation of the calibration slope from 1 can be tested in linear regression by a model that studies the residuals: $Y_{{\rm new}} - \hat {Y} = a + b_{{\rm overall}} \hat {Y}$. The significance of b _overall is then determined as usual in regression, and indicates on average stronger or weaker effects of the predictors in a model.

For binary outcomes, calibration-in-the-large again refers to the difference between ${\rm mean}(\hat {Y})$ and mean(Y _new). A simple comparison can directly be made, with an odds ratio indicating the average under- or overestimation of the outcome: $\begin{array}{rcl}{\rm OR} & = & {\rm odds} \ ({\rm mean}(\hat Y))/{\rm odds}({\rm mean}(Y_{{\rm new}})) = \\& & [{\rm mean}(\hat {Y})/(1 - {\rm mean}(\hat {Y})]/[{\rm mean}(Y_{{\rm new}})/(1 - {\rm mean}(Y_{{\rm new}})].\end{array}$

For statistical testing of the difference we need to be more careful. In logistic regression, the relationship between the outcome y and the linear predictor is non-linear (i.e. logistic). We have to compare ${\rm logit}(Y_{{\rm new}} = 1) - {\rm logit}(\hat {Y})$, where ${\rm mean}({\rm logit}(Y_{{\rm new}} = 1) - {\rm logit}(\hat {Y}))$ is not equal to ${\rm mean}({\rm logit}(Y_{{\rm new}} = 1)) - {\rm mean}({\rm logit}(\hat {Y}))$.

In a model, we could write ${\rm logit}(Y_{{\rm new}} = 1) - {\rm logit}(\hat {Y}) = a;$ or ${\rm logit}(Y_{{\rm new}} = 1) = a + {\rm logit}(\hat {Y}) = a + {\rm offset}({\rm linear \ predictor}).$

The intercept a then reflects the difference in log odds between predictions and observed outcome, adjusted for the linear predictor. The offset makes that predictions are taken literally, as in linear regression. Values of the offset variable are subtracted from the actual outcomes Y _new (as in Poisson regression). Equivalently we can think of a regression coefficient for the offset variable that is fixed at unity. The statistical significance of intercept a can be tested with standard regression tests, such as the Wald test or the likelihood ratio (LR) test.

The calibration slope can be estimated from the recalibration model ${\rm logit}(Y_{{\rm new}} = 1) = a + b_{{\rm overall}} \times {\rm logit}(\hat {Y}) = a + b_{{\rm overall}} \times {\rm linear \ predictor}.$

The deviation of the calibration slope from 1 (“miscalibration”) can be tested by a model that includes an offset variable: ${\rm logit}(Y_{{\rm new}} = 1) = a + b_{{\rm miscalibration}} \times {\rm linear \ predictor} + {\rm offset}({\rm linear \ predictor}).$

The slope coefficient b _overall reflects the deviations from the ideal slope of 1, and can be tested with Wald or LR statistics.

Calibration-in-the-large cannot be detected with a refitted Cox regression model, since the baseline hazard is usually left free in fitting such a model. For a survival outcome, the calibration slope can be assessed as: $\log({\rm hazard}(y_{{\rm new}} = 1)) = h_0 + b_{{\rm overall}} \times {\rm linear \ predictor}.$

The model for deviation from a slope of 1 is: $\log({\rm hazard}(y_{{\rm new}} = 1)) = h_0 + b_{{\rm miscalibration}} \times {\rm linear \ predictor} + {\rm offset}({\rm linear \ predictor}).$

Testing of coefficient b _{miscalibration} is as usual, i.e. with a Wald test or LR test.

With a parametric survival model, we can specify parameters that reflect differences in average survival, after adjustment for predictor effects. van Houwelingen transformed the baseline hazard from a Cox model to a Weibull model.⁴⁵⁶ The Weibull model has two parameters to describe the baseline hazard parametrically (Chap. 4). These two parameters can be refitted for external validation data, together with the linear predictor, to estimate a recalibrated model.

15.3.6 15.3.6 Other Calibration Measures

Various other measures are available for calibration. An intuitively appealing measure of calibration is the absolute difference between smoothed observed outcomes and predicted probabilities (Harrell’s E statistic).¹⁷⁴ This measure is related to the calibration plot, and depends on the way the 0/1 outcomes are smoothed. The difference between smoothed observed outcomes and predicted probabilities can also be judged visually in a calibration plot such as Fig. 15.9.

15.3.7 15.3.7 Calibration Tests

Statistical tests can be performed with various null hypotheses for calibration, phrased in the formulation of the recalibration model $y \sim a + b_{{\rm overall}} \hat {y}$ (Table 15.5). Tests for calibration-in-the-large and calibration slope have one df; the calibration test has two df. The test for calibration-in-the-large requires that the predictions are taken literally (b _overall = 1). In generalized linear models, this can be achieved with an offset variable. The calibration slope can easily be estimated in the recalibration model. The recalibration test has several advantages (Table 15.6). It can pick-up common patterns of miscalibration, i.e. systematic differences between the new data and the model development data, and overfitting of the effects of predictors. Moreover the test parameters a and b _overall are well interpretable, provided that a | b _overall = 1 is reported (rather than a with b _overall left free). The slope b _overall can directly be taken from the re-calibration model (where a is left free).

Table 15.5 Calibration tests for prediction model $y \sim a + b_{{\rm overall}} \hat {y}$

Full size table

Table 15.6 Summary of some measures for calibration of a prediction model for binary outcomes

Full size table

Statistical testing for calibration has a number of drawbacks. First, the null hypothesis is of good calibration. Hence, if we test calibration in a small study, we have low power and will not reject the null hypothesis unless miscalibration is very severe. On the other hand, even a model with very good, but not perfect, calibration will fail a calibration test if the sample size is sufficiently large.

15.3.8 15.3.8 Goodness-of-Fit Tests

Calibration is related to goodness-of-fit, which relates to the ability of a model to fit a given set of data. Typically, there is no single goodness-of-fit test that has good power against all kinds of lack of fit of a prediction model. Examples of lack of fit are missed non-linearities, interactions, or an inappropriate link function between the linear predictor and the outcome. Goodness-of-fit can be tested with a χ² statistic.

For binary outcomes, the Hosmer-Lemeshow (H-L) goodness-of-fit test is often used.¹⁹⁹ Usually, patients are grouped by decile of predicted probability. The sum of predicted probabilities is the number of expected outcomes; this expected number is compared with the observed number in the ten groups with a χ² test. In model development, this χ² test has eight degrees of freedom; at external validation the degrees of freedom is 9. There are many drawbacks to the H-L test.^198,174 First, there are some technical issues: Should we always use deciles of predictions, or make the quantiles dependent on the sample size? Can we group by risk-interval, e.g. 0–10%, 11–20%, etc (“interval grouping”)? Second, the test has poor power to detect miscalibration in the common form of systematic differences between outcomes in the new data and the model development data, or to detect overfitting of the effects of predictors. Some proposed that the H-L test should only be used in model development, in addition to more specific tests on model assumptions, such as tests for linearity (adding non-linear transformations) and additivity (adding interaction terms). Reported H-L tests are usually non-significant if they reflect apparent validation on the data that were also used to construct the model. Such non-significant results may contribute to the face validity of a model as perceived by some readers, but have no scientific meaning.

Alternative goodness-of-fit tests have been proposed with better statistical properties, such as the Goeman-Le Cessie goodness-of-fit test.^250,141 It assesses the alternative hypothesis that any nonlinearities or interaction effects have been missed in a logistic regression model. Such neglected effects can be detected by looking for patterns in the residuals: Observations close to each other in covariate space, which deviate from the model in the same direction. The approach is to smooth the regression residuals and to test whether these smoothed residuals have more variance than expected under the null hypothesis, which occurs when residuals that are close together in the covariate space are correlated. The test statistic is a sum of squared smoothed residuals.

Another approach to goodness-of-fit is to study observed vs. expected outcomes in subgroups of patients. For example, we can assess the difference between observed vs. expected outcomes in males and females, or other subgroups of patients. If the effect of the subgroup is not well modelled, e.g. an interaction was missed, this might be reflected in this assessment. There are however more direct ways of assessing the influence of subgroup characteristics, as was discussed in Chap. 13 on model specification. So, this check for calibration is also more for face validity of the model and for convincing potential users than a serious check of calibration. Measures for assessment of calibration are compared in Table 15.6.

15.3.9 15.3.9 Calibration of Survival Predictions

For survival outcomes, formal tests similar to the H-L test are possible by comparison of observed K-M percentages with average predictions across groups of patients. Furthermore, we can study the distribution of Cox-Snell residuals, in a plot of the cumulative hazard vs. the residuals, which should form a straight line.¹⁷⁴

15.3.10 15.3.10 Example: Calibration in Testicular Cancer Prediction Model

For the prediction model of residual mass histology, we plot the actual outcome vs. predicted for the development sample and the validation sample (Fig. 15.10). We include the distribution of predicted risks, such that discrimination can also be judged. The results by decile of predicted risk are shown in Table 15.7, to clarify the calculation of the Hosmer-Lemeshow statistic. Other tests for miscalibration included the overall test for calibration-in-the-large and calibration slope, and the Goeman–Le Cessie test, which were non-significant for model development and external validation (Table 15.8).

Table 15.7 Hosmer-Lemeshow test for calibration of the testicular cancer prediction model

Full size table

Table 15.8 Calibration of testicular cancer prediction model

Full size table

15.3.11 15.3.11 R Code

The Hosmer-Lemeshow test is implemented in a simple function hl.ext. The user can specify the number of groups (ten by default) and degrees of freedom (groups – 2 for model development, groups – 1 for model validation).

Calibration plots are made by an extension of Harrell’s val.prob function, called val.prob.ci. This function also provides assessment of calibration-in-the-large, calibration slope, and the calibration test p-value. Goeman provided R code for the functions mlogit (for binary of multinomial logistic regression), smoothU (for calculation of smoothed residuals), and testfit (for the Goeman-Le Cessie goodness-of-fit test).

15.3.12 15.3.12 Calibration and Discrimination

The calibration plot can be extended into a “validation plot” as a central tool to visualize model performance. Calibration is shown by observed outcomes being close to prediction, while discrimination aspects can be indicated with the distribution of the predicted probabilities. The distribution can be shown by a histogram or density distribution. We can also make separate histograms for those with and without the outcome for further insights (see e.g. Fig. 15.10). It also helps to see the separation according to quantiles of predicted probabilities. For example, when deciles are used, these will be relatively far apart for a good discriminating model.

Calibration-in-the-large is a phenomenon that is fully independent of discrimination. For example, we can change the incidence of the outcome in a case-control study, but the discrimination will be unaffected. The calibration slope however has a direct relationship with discrimination. If the calibration slope is below unity, the discrimination is generally lower. Hence, overfitted models will show both poor calibration and poor discrimination when validated in new patients (Chap. 19).

Perfect calibration is possible with poor discrimination, for example when the range of predicted probabilities is small, such as between 9 and 11% for an average incidence of the outcome of 10%. At external validation, such a small range in predictions may arise from a narrow selection of patients (homogeneous case-mix). A drop in discriminative ability compared with the development setting can hence be explained by overfitting (calibration also poor), or a more homogeneous in case-mix (independent of calibration, see Chap. 19). On the other hand, a well discriminating model can have poor calibration, which can be corrected with various updating methods (Chap. 20).

15.4 15.4 Concluding Remarks

In this chapter we have discussed a number of performance measures for prediction models; many more can be used, as systematically discussed in work by Hilden, Bjerregaard, and Habbema in the 1970s.^{162,191,192,161,163} Many performance measures are related to each other; e.g. the c statistic is related to the Mann-Whitney U statistic, which is calculated as a rank order test for the difference between predictions by outcome. The c statistic is also linearly related to Somer’s D statistic (c=D/2 + 0.5).

From a simple statistical perspective we want a small distance between observed outcome Y and predicted outcome Ŷ. Explained variation (R ²) can then be used to indicate performance, and indicates the predictability of the outcome: How much do we know already about the phenomena that lead to the outcome.³⁷² Diagnostic prediction models would hence be expected to have higher R ² than prognostic models with long-term outcome. Indeed, prognostic models usually have R ² around 0.20. This indicates that substantial uncertainty remains at the individual level; we can only provide probabilities, and no certainty on the individual outcome.^13,112

We have focused on measures that are in wide use in medical journals nowadays, including the concordance statistic (‘c,’ or area under the ROC curve) for discrimination, and various tests for calibration and goodness-of-fit. The c statistic has been criticized by some, and should not be the only criterion in assessment of model performance. Especially, c may be rather insensitive to inclusion of additional predictors in prediction models, such as novel biomarkers.^79,330 But our theoretical examples and case study show that the c statistic is a key measure; it is closely related to other performance measures such as R ² and Brier score.

In principle we might focus our modelling strategy on optimizing performance measures such as the c statistic. Indeed, estimation algorithms have been described that maximize the c statistic rather than the log likelihood.³³²

Compared with current practice, calibration should receive more attention when evaluating prediction models. The recalibration test and its components (calibration-in-the-large and calibration slope) should be used routinely in performance assessment in external validation of prediction models.

15.4.1 15.4.1 Bibliographic Notes

The framework of a recalibration model was already proposed by Cox,⁸⁶ and has been supported by many other researchers for evaluation of model performance.^{81,174,290,291,458} Nice illustrations of diagnostic test evaluation with ROC curves are available at:

http://www.anaesthetist.com/mnm/stats/roc/
Nice illustrations of Lorenz curves and the Gini index are at:
http://en.wikipedia.org/wiki/Gini_coefficient

15.5 Questions

15.1
Overall performance measures

Overall performance measures for logistic regression models include Brier score and R ² type of measures, such as Nagelkerke’s R ².
1. (a)
  What values can Brier scores and R ² take?
2. (b)
  What types of scoring rule are Brier and R ²?
3. (c)
  What are disadvantages of Brier and R ²?
15.2
Lorenz curve and incidence (Fig. 15.6)

In a Lorenz curve, the visual impression of a model with a c statistic of 0.80 depends on the incidence of the outcome.
1. (a)
  What happens when a Lorenz curve is made for situation with 1% incidence?
2. (b)
  And what for 99% incidence?
15.3
Interpretation of validation graph (Fig. 15.10)

Validity of predictions can well be judged graphically. How do you judge
1. (a)
  calibration-in-the-large?
2. (b)
  calibration slope?
3. (c)
  discrimination?
15.4
Relationship between calibration, discrimination, and overall performance.

Explain the differences and the relation between calibration, discrimination, and overall performance measures.

Author information

Authors and Affiliations

Department of Public Health, Erasmus MC, 3000, CA, Rotterdam, The Netherlands
E.W. Steyerberg

Authors

E.W. Steyerberg
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Steyerberg, E. (2009). Evaluation of performance. In: Clinical Prediction Models. Statistics for Biology and Health. Springer, New York, NY. https://doi.org/10.1007/978-0-387-77244-8_15

Download citation

DOI: https://doi.org/10.1007/978-0-387-77244-8_15
Published: 17 September 2008
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-77243-1
Online ISBN: 978-0-387-77244-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Evaluation of performance

Background

Keywords

15.1 15.1 Overall Performance Measures

15.1.1 15.1.1 Explained Variation: R2

15.1.2 15.1.2 Brier Score

15.1.3 15.1.3 Example: Performance of Testicular Cancer Prediction Model

15.1.4 15.1.4 Overall Performance Measures in Survival

15.1.5 15.1.5 Decomposition in Discrimination and Calibration

15.1.6 15.1.6 Summary Points

15.2 15.2 Discriminative Ability

15.2.1 15.2.1 Sensitivity and Specificity of Prediction Models

15.2.2 15.2.2 Example: Sensitivity and Specificity of Testicular Cancer Prediction Model

15.2.3 15.2.3 ROC Curve

15.2.4 15.2.4 R2 vs. c

15.2.5 15.2.5 Box Plots and Discrimination Slope

15.2.6 15.2.6 Lorenz Curve

15.2.7 15.2.7 Discrimination in Survival Data

15.2.8 15.2.8 Example: Discrimination of Testicular Cancer Prediction Model

15.2.9 *15.2.9 Verification Bias and Discriminative Ability

15.2.10 15.2.10 R Code

15.3 15.3 Calibration

15.3.1 15.3.1 Calibration Plot

15.3.2 15.3.2 Calibration in Survival

15.3.3 15.3.3 Calibration-in-the-Large

15.3.4 15.3.4 Calibration Slope

15.3.5 15.3.5 Estimation of Calibration-in-the-Large and Calibration Slope

15.3.6 15.3.6 Other Calibration Measures

15.3.7 15.3.7 Calibration Tests

15.3.8 15.3.8 Goodness-of-Fit Tests

15.3.9 15.3.9 Calibration of Survival Predictions

15.3.10 15.3.10 Example: Calibration in Testicular Cancer Prediction Model

15.3.11 15.3.11 R Code

15.3.12 15.3.12 Calibration and Discrimination

15.4 15.4 Concluding Remarks

15.4.1 15.4.1 Bibliographic Notes

15.5 Questions

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation

15.1.1 15.1.1 Explained Variation: R²

15.2.4 15.2.4 R² vs. c