Introduction

Risk prediction models hold enormous potential for assessing surgical risk in a standardized, objective manner [1]. They can be used to guide clinical decision making and perioperative management, enable informed consent, stratify risk for inclusion into randomized controlled trials, and audit, monitor, assess, and compare surgical outcomes in different healthcare providers [2, 3]. Regardless of the reasons for using a particular risk prediction model, it is important that it is appropriately developed and validated [2, 4, 5].

The last 10–15 years have seen an explosion in risk scores predicting surgical outcomes, such as mortality [6], complications [7], morbidities [8], and bleeding [9]. More widely, risk prediction models have proliferated in the medical literature, resulting in many competing models for the same outcome or target population. For example, there are nearly 800 models for patients with cardiovascular disease [10], over 360 models for predicting incident cardiovascular disease [11], 263 models in obstetrics (with 69 predicting the risk of preeclampsia) [12], over 100 models for patients with prostate cancer [13], and 20 models for predicting prolonged intensive care stay following cardiac surgery [14].

Despite the vast number of risk prediction models developed, they have not lived up to their potential. Systematic reviews examining the methodological and reporting quality of these models have found widespread deficiencies that limit their usefulness [12, 1415, 16•, 1721]. The aim of this paper is to provide an overview of the methodological issues that should be considered when developing and validating a risk prediction model to ensure a useful, accurate model. We use the EuroSCORE model as our case study [1]. The development of the original EuroSCORE was described in two separate articles, one identifying the risk factors [22] and one to construct the model [23].

Assessing the Need for a New Risk Prediction Model

Before deciding to build a new risk prediction model, it is useful to check whether there are any existing models that predict similar outcomes in your target population and clinical setting to avoid duplication of effort. If such models do already exist, then you should first evaluate and compare their predictive performance on your data [24]. If they show promising performance, then recalibration or updating may produce the model that you need [25, 26]. A new model should only be developed from scratch if a similar model does not exist or if similar models cannot be recalibrated to meet the needs of your particular target population and clinical setting. The CHARMS checklist provides guidance on how to conduct systematic reviews of risk prediction models, including how to search, what information to extract, and how to assess study quality and risk of bias [27•].

Overview of Steps in Developing and Validating a Risk Prediction Model

If you have access to an existing dataset and do not have to prospectively collect data, then developing a risk prediction model is easy. You can load the data into your statistical software package, click a few buttons, and churn out yet another new model [28]. However, just because you can easily develop a model, it does not mean you should. The resulting model may add an extra line to your list of publications, but it is (hopefully) highly unlikely that an ill-thought-out model will ever be used on an actual patient.

As with any research study, there should be a clear rationale for why a new risk prediction model is needed. A detailed protocol describing every step needed to develop and validate the model should be written and, if possible, published (for example, in diagnprognres.biomedcentral.com) [29•]. The abundant methodological and practical guidance now available to investigators wishing to develop or validate a risk prediction model leaves little excuse for producing unusable models [4, 5, 3038, 39••, 40, 41•]. Table 1 gives a brief overview of the main issues which are discussed in more detail throughout the article.

Table 1 Considerations for developing a multivariable risk prediction model

Design

An appropriate study design is the key for developing or validating risk prediction models. The preferred design for both development and validation studies is a prospective longitudinal cohort study. This design gives the investigator full control to ensure all relevant predictors and outcomes are measured and collected, thereby minimizing missing values and loss to follow-up. However, risk prediction models often have to be developed and validated using existing data collected for a different purpose. Although using existing data is cost efficient and convenient, these datasets have clear problems. They are often small, include too few outcome events, have missing values, do not include important predictors, or use inaccurate methods for measuring important predictors. Data from randomized clinical trials can be used, but trials’ strict eligibility criteria can often limit their generalizability, and the issue of how to handle treatment assignment needs to be addressed [42, 43]. Case–control studies are generally not appropriate for developing prediction models as the correct baseline risk or hazard cannot be estimated from the data [44] unless a nested case–control or case-cohort design is used [44, 45].

EuroSCORE was developed using a prospective cohort study, involving 132 centres from 8 European countries [22, 23]. All patients (n = 20,014) undergoing cardiac surgery between September and December 1995 were included, with 984 (with approximately 5 % of the cohort omitted after error checking and quality control), leaving 19,030 for analysis [22]. Using a prospective cohort design enabled efficient collection of 97 preoperative and operative risk factors that were deemed credible, objective, and reliable.

Sample Size

Sample size recommendations for studies developing new risk prediction models are generally based on the concept of events-per-variable (EPV). What this means is that to reduce the risk of overfitting, whereby the model performs optimistically well on the dataset used to develop the model, but poorly on other data, the investigator should control the ratio of number of outcome events to the number of variables examined. More appropriately, it is the number of coefficients estimated, for example, a categorical predictor with k categories, this would require k-1 regression coefficients to be estimated. Furthermore, it is the number of variables examined prior to any variable selection, including any univariate screening of individual variables, which should be avoided [46].

A minimum value of 10 EPV is widely used [47, 48] as the value to avoid overfitting in development studies, although the regression coefficients may then need shrinking. However, much larger EPV values are preferable [49, 50]. The minimum sample size recommended for validation studies is 100 outcome events. Two hundred outcome events are preferred to ensure accurate estimation of model performance [5153].

To develop the EuroSCORE model, the authors randomly split the dataset into two cohorts, in a seemingly 90:10 ratio for the development and validation cohorts. The development dataset therefore comprised 13,302 patients, whilst the validation cohort comprised 1479 patients [23]. Neither the number of deaths nor the mortality rate was reported separately for both cohorts but only overall (698 deaths); thus, we assume there were 628 and 79 deaths, respectively (assuming a 90:10 random split). The authors should clearly describe the number of events for each separate analysis.

Missing Data

Although almost all studies are missing information in their predictors or their outcomes, missing data are often handled inadequately [54]. Missing data are often handled using a ‘complete-case’ analysis, which only includes cases with complete information on all predictors and outcomes in the analysis. However, simply excluding individuals with any missing values can lead to biased estimates and standard errors if the missing data are related to the outcome (missing at random; MAR) [55, 56]. This approach also makes the strong assumption that the reason for the missing data is not related to the outcome, which is rarely met.

Imputation approaches are more effective than complete case analysis. These approaches replace missing values from an estimate of the distribution of the observed data and assume the MAR mechanism. Single or multiple imputation can be conducted [57, 58]. Single imputation uses only one estimate (e.g., overall mean estimation) and commonly results in an underestimated standard error [59, 60]. In multiple imputation, several plausible datasets are created, and an analysis runs on each dataset. The results are combined into a single estimate with standard errors reflecting the uncertainty with the missing values. Multiple imputation leads to more correct standard errors [59, 61]. Five or 10 imputed datasets are commonly used. However, recently published rule-of-thumb recommendations suggest that the number of imputations should be larger or equal to the fraction (%) of the missing data [58]. Practical guidance for handling missing data when developing and validating risk prediction models should be followed [6264].

In the EuroSCORE study, the handling of missing data is somewhat unclear. What the observant reader may have already noticed, is that the original EuroSCORE database comprised 19,030, yet the development (n = 13,302) and validation (n = 1497) sample size was substantially smaller, with and unexplained missing 4231 patients (20 % of the original dataset). With regards to completeness of individual risk factors, neither publication mentions the presence of the missing data [22, 23]. Were the 4231 patients omitted due to missing data? Is there anything special about these omitted patients? Regardless, omitting such a large proportion of the data is worrying and little is described as to why and the implications of doing so. Studies should clearly report the flow of participants, describing the missingness data at individual predictor levels as well as overall.

Modelling Continuous Predictors

Predictors are often recorded as continuous measurements, but are commonly converted into two or more categories for analysis [65]. This categorization of continuous predictors has several disadvantages. Categorisation discards information, a problem at its most severe when the predictors are dichotomized (divided into two categories). This information loss can result in a loss of statistical power and can force an incorrect relationship between the predictor and outcome. If the cut points to create the categories are not predefined, but are chosen to find the smallest P value, then the predictive performance of the model will be overly optimistic [66, 67•]. Even with a prespecified cut point, dichotomisation has been shown to be statistically inefficient [6872]. Although using more categories reduces the information loss, this is rarely done in practice. Regardless of the number of categories, the statistical power is reduced and precision suffers in comparison with a continuous modelling approach [73]. Categorizing continuous predictors ultimately leads to poor models, as it forces an unrealistic, biologically implausible, incorrect (step) relationship onto the predictor and discards information.

The most popular approach for maintaining the continuous nature of predictors is to model a simple linear relationship between the predictor and outcome. This is often, but not always, sufficient. It may lead to a model that does not include the relevant predictor or that has an assumed relationship between the predictor and outcome that is substantially different from the “true” relationship. A better fit can be achieved using methods such as fractional polynomials (FP) or restricted cubic splines [40, 7375]. Both of these methods allow for a nonlinear, but smooth, predictor–outcome relationship, and there is little to choose between them [67•]. Both methods can easily be implemented using standard software. FPs allow simultaneous model selection and FP specification. The results of both methods can be graphically presented, although FP results are particularly easily interpretable. It is possible to categorize predictors to implement the model if this is deemed necessary. Importantly, categorizing predictors for implementation does not require the predictors to be categorized prior to model development [76].

In the EuroSCORE study, continuous predictors were categorized using fractional polynomials [23]. It is not entirely clear what this entailed, fractional polynomials are used to describe nonlinear associations with a predictor and the outcome, and not for categorizing [73]. Nevertheless, categorizing leads to models with lower predictive accuracy [67•], and if a simple easy-to-use model is required, then more methodologically robust approaches are available [76].

Model Development

More variables are often collected than can reasonably be included in a prediction model, and therefore a smaller number of variables must be selected. Variables can be reduced before modelling by, for example, critically considering the literature, soliciting input from experts, examining correlated predictors and only including one of them, removing variables with high amounts of missing data (as these will likely be missing at the point of implementing the model in practice), and removing variables that are expensive to measure [77]. Variables are often chosen for inclusion in multivariable modelling using univariate (unadjusted) associations with the outcome. However, this common approach should be avoided as important predictors can be omitted due to confounding by other predictors [46].

Data-driven approaches such as stepwise methods (e.g., forward or backward) are common and are implemented in most statistical software. The backward selection approach is generally preferred as it considers the full model and allows the effects of all of the candidate predictors to be judged simultaneously [49]. However, these stepwise methods all have limitations in small datasets [78, 79]. When datasets are small in relation to the number of predictors examined, overfitting becomes a nonignorable concern and predictions from the model can on average be too extreme (too low or too high). Shrinkage techniques (e.g., uniform shrinkage and LASSO) can be used to reduce overfitting. Models are penalized towards simplicity by shrinking small regression coefficients towards or to zero, omitting them from the model.

The development of EuroSCORE is slightly opaque, but it seemed to include a screening of candidate risk factors based on their univariate association with the outcome (at the P < 0.2 level), followed by a fitting of the remaining risk factors, and their inclusion in the final model was then based on whether they improved predictive accuracy [23]. The final step was a search of first-degree interactions significant at P < 0.05, but it is not clear whether all possible interactions were examined or only a subset comprising those that are clinically plausible. Given the large number of centres and different countries, no attempt seems to investigate whether clustering either at a centre or in a country would have improved the model [80]. Finally, the investigators examined 97 risk factors in total; it would seem unlikely that so many risk factors could actually be plausibly related to the outcome, particularly as most risk scores only contain a handful of predictors. The consequence of having such a large number of predictors is the risk of overfitting; 97 risk factors would imply a minimum of 970 outcome events are required in the data to develop the model.

Internal Validation

Deficiencies in the statistical analysis used to develop a prediction model, such as inappropriate handling of missing data, small data, many candidate predictors, a poor choice of predictor selection strategies (including univariate screening and stepwise regression), and categorization of continuous predictors, can lead to optimistic model performance [81]. A model’s performance must therefore be adequately and unbiasedly evaluated. Evaluation can be done using a so-called internal validation. Internal validation of a prediction model refers to evaluating its performance (see the “Assessing Model Performance” section) in patients from the same population that the sample originated from.

A common approach is to randomly split the dataset into two smaller datasets. The model is derived using one of these datasets (often called the training or development dataset), then its performance is evaluated using the other dataset (often called the test or validation dataset) [30]. This split-sample approach is common, but inefficient. For small to moderately sized datasets, this approach does not use all of the available data to develop the model (making overfitting more likely) and uses an inadequately small dataset for performance evaluation [81]. For large datasets, randomly splitting the data merely creates two identical datasets, which is hardly a strong test of the model.

The preferred approach for internal validation is to use bootstrapping to quantify and adjust any optimism in the predictive performance of the developed model [82]. All model development studies should include some form of internal validation, preferably using bootstrapping, particularly if no additional external validation is performed [39••].

As noted earlier, the EuroSCORE developers randomly split their data into a development cohort and a separate validation dataset, in a 90:10 split. This approach is weak and does not constitute a strong test of the model and unlikely to have the ability to quantify any overfitting.

External Validation

After a prediction model has been corrected for optimism with internal validation procedures, it is important to establish whether it is generalizable to similar but different individuals beyond the data used to derive it. This process is often referred to as external validation [30]. The more external validation studies using data from different settings, and thus different case mixes, the more generalizable the model and the more likely it will be useful in untested settings. External validation can be carried out using data from the same centres as the development data collected at a different time (temporal validation) or can be carried out using data collected from different centres (geographic validation). Model evaluation by independent investigators is a strong test of external validation. Validation is not refitting the model on new data, nor is it repeating all of the steps in the development study. Validation applies the published model (i.e., all of the regression coefficients and the intercept of baseline survival at a given time point) to new data to obtain predictions and quantify model performance (calibration and discrimination). The recommended sample size for validation studies is a minimum of 100 outcome events, preferably 200 [5153].

Assessing Model Performance

The aim of both internal and external model validation is to quantify a model’s predictive performance [17] to indicate whether it is fit for purpose and better than any existing models [24, 83•]. Discrimination and calibration are the two key characteristics of model performance that must be assessed [84••]. Discrimination is the model’s ability to distinguish between individuals with and without the outcome of interest. It is commonly estimated with the c-index. The c-index is identical to the area under the receiver-operating characteristic curve for models predicting binary endpoints (e.g., logistic regression) [85]. It can be generalized for survival models accounting for censoring (e.g., Cox regression) [86].

Calibration refers to the agreement between predictions from the model and observed outcomes. That is, if the model predicts a certain risk to develop a disease, an equivalent proportion of patients with the disease should be observed in the validation sample. Calibration is preferably assessed using calibration plots showing the relationship between the observed outcomes and predicted probabilities, using a smoothed lowess line [53, 87]. Perfect calibration therefore corresponds to a slope of 1 and intercept of 0 [88]. Intercepts greater than 0 and a slope less than 1 indicate overfitting of the model [89]. The Hosmer–Lemeshow goodness-of-fit test for binary outcome models is commonly used to evaluate model calibration [16•]; however, the test has limited ability to evaluate for calibration, and is often nonsignificant (e.g., calibrated) for small sample sizes and nearly always significant for large datasets (e.g., lack of calibration). Furthermore, the tests fails to indicate magnitude of direction of any miscalibration and as such should be avoided, in preference for calibration plots [39••]. Discrimination and calibration, and other statistical methods to evaluate model performance (such as R-squared, Brier score) [83•] characterize the statistical properties of a prediction model, but do not capture the clinical consequences. Approaches such as decision curve analysis and relative utility should be considered to gain insight into the clinical consequences of using the model at specific probability (treatment) thresholds [90, 91].

The EuroSCORE model was evaluated by assessing discrimination and calibration. The salient point to highlight here is the use of the Hosmer–Lemeshow test [23]. Whilst the test produced P-values larger than 0.05 which the authors indicated as good calibration, this provides no meaningful indication of how well the model is calibrated. No calibration plots were presented, and as such it is unclear whether the model under- or over-predicts, or whether there are any particular subgroup of patients where the model appears less accurate.

Reporting

Numerous systematic reviews have shown that studies describing the development or validation of a risk prediction model are often poorly reported, with key details frequently omitted from published articles. Critical appraisal and synthesis are impossible when key details about the methodology used and the results are not fully reported, making it difficult for readers to judge whether a risk prediction model has any value. When a paper presents the development of a risk prediction model, it is absolutely vital that the full model, including all regression coefficients and the intercept/baseline hazard, is either presented in the paper or in an appendix, or that a link to computer code is provided, to allow other investigators to evaluate the model. Whilst this may appear obvious, many published articles on risk prediction models deliberately or unknowingly fail to report the actual model that they have developed (e.g., FRAX [92]). A nomogram (a graphical presentation of a risk prediction model [93]) is not a replacement for presenting the actual risk prediction mode [94].

In an effort to harmonize and improve the reporting of studies developing or validating risk prediction models, the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) Initiative produced the TRIPOD Statement, which is similar to the CONSORT statement for randomized clinical trials. Published simultaneously in 11 journals, the TRIPOD Statement is a checklist of 22 key items that authors should address in their articles describing the development or validation or a risk prediction model [39••, 84••].

The brief snippets we have emphasized throughout this article of specific issues in the publications describing the development of EuroSCORE highlight the problems with incomplete reporting. Only with full and transparent reporting can readers critically appraise the methodology and interpret the results.

Conclusion

Risk prediction models have great potential to aid in operative risk assessment. However, for these models to have any chance of being useful for clinical decision making, they must be developed using appropriate statistical methods and validated by others in different settings to determine their predictive accuracy. As studies developing prediction models are unfortunately rarely prospective, investigators face challenges such as how to handle missing data, what to do with continuous predictors, how to carry out an internal validation, and how to conduct a meaningful external validation study. They must also ensure complete and comprehensive reporting of every step of the study.

The surgical and anaesthesiology literature contains hundreds, if not thousands, of models developed for operative risk assessment. Only a very small minority have made any kind of impact on clinical practice. Point-and-click statistical software has arguably contributed to the plethora of methodologically weak, unusable prediction models [28]. It is therefore important to engage with a suitably experienced statistician before developing a new prediction model, to check whether a suitable model already exists, and to plan and, if possible, publish a protocol outlining the necessary steps for model development [29•].

Prediction models are usually static, reflecting the case mix in the data used to develop them. However, as mortality following surgery decreases and the case mix evolves over time, prediction models can become outdated and less accurate. This process is called calibration drift [95, 96]. Developed in 1999 using data from a 3-month period in 1995, EuroSCORE is a classic example of calibration drift [97]. The updated EuroSCORE II was therefore developed in 2012 [1], although still with some methodological concerns [98, 99]. Unless periodic updating is done, it is likely that this model will also quick become outdated.

In summary, risk prediction modelling is a growing field that is gaining huge interest in the era of personalized medicine. Although there are no shortcuts and many challenges when developing and validating accurate, useful prediction models, these challenges are surmountable, if the abundant methodological and practical guidance available is used correctly and efficiently.