Introduction

Since the introduction of the first risk prediction tool for cardiovascular disease in 1976 [1], there has been a steady rise in the number of prediction models in various fields of clinical epidemiology including cardiology, oncology, and pediatrics. In the field of perinatal epidemiology, the motivation for prediction modeling has been to identify women at highest risk of a negative health outcome to guide prevention strategies for the mother and infant. More specifically, prediction modeling has enabled physicians to provide individualized care to women and their infants through evidence-based decision making [2].

Prediction models have been broadly used in various fields of perinatal epidemiology to predict treatment success for women undergoing fertility treatments [3], predict complications of pregnancy (e.g., preeclampsia [4] and fetal growth restriction [5]), predict outcomes at delivery (e.g., vaginal birth after a cesarean section [6]) and in the post-partum period (e.g., post-partum hemorrhage [7] and neonatal mortality in preterm infants [8]), and to rule out women at risk of an adverse outcome [9]. More recent literature has examined the association between pregnancy-related exposures and long-term outcomes in mothers and children [10, 11]. Despite the increasing number of risk prediction models being developed, few models are of sufficiently high quality or easily implemented in routine clinical practice [2, 12, 13•]. This discrepancy can be attributed to a number of factors including (1) inappropriate methods for model development and validation; (2) the choice of data sources and populations for model development and validation; (3) absence or imprecision in the measurement of important predictors; and (4) lack of external validation. The implications of these factors for the implementation of prediction models in routine practice will be discussed in further detail in this review.

From Development to Use in Clinical Practice

Prior to implementation of prediction models in clinical practice, researchers need to (1) develop and internally validate the model; (2) perform external validation; and (3) assess the clinical impact of the model (Fig. 1). All three components are needed to provide clinicians an objective measure for risk stratification above clinical judgement [14].

Fig. 1
figure 1

Steps to building a risk prediction model

Development and Internal Validation

The first step in model development is the identification of potentially relevant predictors based on substantive knowledge and the existing literature. Considerations for selection of candidate predictors are discussed below. Once a list of candidate predictors has been created, data reduction is performed to remove predictors with narrow distributions (limited ability to explain variation in outcome) or a large degree of missingness to increase model validity and parsimony [15]. Collinearity between predictors should be assessed and minimized either by choosing predictors based on objective criteria, which may include clinical relevance, availability, reliability, or cost of measurement. A full model is then estimated using variables not previously eliminated (i.e., the strength and direction of association between each predictor and outcome is estimated). Ideally, continuous variables are modeled using restricted cubic splines or other smoothing functions such as fractional polynomials, and categorical variables are modeled using indicator variables. When fitting the model, shrinkage methods should be considered when dealing with small sample sizes to reduce the potential for model overfit [16, 17]. The final step of model development involves further data reduction for which various methods have been proposed [18]. A well-established approach to data reduction is the stepdown approach of Harrell et al. [15, 17]. The benefit of this approach is that it is done independently of the outcome, which reduces systematic bias and avoids using p values for variable selection, which tend to result in model overfit and poor model performance [15, 17].

Assessing Model Performance

Once the final model is established, its predictive performance is examined using measures of accuracy and validity. Performance can be grouped into three main categories, (1) discrimination, (2) calibration, and (3) risk stratification. Although risk stratification is not commonly used to examine model performance, its addition provides a comprehensive assessment since it evaluates a model’s capacity to appropriately stratify patients. The utility of all three metrics to assess the performance of prediction models is illustrated by the stillbirth calculator to identify the risk of stillbirth in women [19] and the fullPIERS model to identify the risk of adverse maternal outcomes in women with preeclampsia [20]. Although both models were found to perform well based on standard metrics of discrimination and calibration, risk stratification allowed investigators to identify optimal thresholds (based on the rate of false positive and true positive predictions) to assist clinicians in their choice of treatment options for these women.

Discrimination refers to how well the model discriminates between individuals with and without the outcome [18]. A commonly used method to assess discrimination is the estimation of the area under the receiver operating curve (AUC) or the c statistic [21], in which a value of 1 refers to perfect discrimination, and 0.5 is equivalent to random chance. If the prediction model involves time-to-event data, standard metrics pose problems due to unobserved event times as a result of right censoring [22]. Moreover, the presence of censoring during follow-up warrants additional consideration, since the ordering of events becomes difficult to decipher; Harrell’s concordance (c) index [18, 23], Royston and Sauerbrei’s D statistic [24], and the weighted Brier score [18, 23, 25] have been proposed to address this limitation. Harrell’s c index in the context of time-to-event data is a rank order statistic measuring the ability of a model to discriminate between individuals with different event times [18, 23]. It is a measure of the probability of concordance between observed and predicted survival probabilities given that pairs are useable (≥ 1 individual experiences the event of interest) [26]. Therefore, a model with good discriminative properties will assign a higher predicted probability to an individual with the event compared to an individual without the event at the same time point [15]. Royston and Sauerbrei’s D statistic is an absolute measure of separation of survival curves that measures discrimination between strata of risk groups and the baseline hazard [24]. The Brier score is a quadratic scoring rule that estimates the squared distance between the observed and predicted outcomes [18, 23]. As a measure of explained variation, it can be used to assess calibration, as well as, goodness-of-fit of the model. A Brier score can take values from 0 to 1, with a value of 0 suggesting perfect prediction. Brier scores are generated based on the prediction times from the models calculated at fixed time points (e.g., 6- or 12-month intervals) to generate time-dependent curves [23]. If censoring is found to be substantial and informative, a weighted Brier score using inverse probability of censoring weights should be used [25, 27].

Calibration refers to the agreement between the predicted and observed outcomes [18]. For prognostic estimates, calibration is important since it provides a measure of model reliability [18]. Calibration plots are constructed as a function of the predictions from the model on the x-axis and the observed outcomes on the y-axis with perfect predictions falling on the 45° line [18]. Plots can also be generated by grouping individuals based on their predicted probability of the outcome with a larger division between groups indicating improved discrimination. In the context of time-to-event data, calibration curves are created from predicted probabilities obtained from the final models and compared to observed probabilities obtained from Kaplan-Meier estimates at fixed time intervals [15]. Calibration is then measured as the difference between the observed and expected survival estimates at specified time intervals. This difference can then be used to correct the performance measures for the degree of optimism or overfit of the model. A second measure of calibration is goodness-of-fit, which is commonly measured using the Hosmer-Lemeshow goodness-of-fit test for binary outcomes. The number of expected and observed outcomes is compared within groups of individuals using a χ2 statistic. Goodness-of-fit for survival models is typically assessed using calibration curves and the Brier score or by comparing the Cox-Snell residuals and the cumulative hazard function within risk categories [18].

Risk stratification is an important measure of performance since it assesses the capacity of a prediction model to stratify individuals into clinically relevant risk groups [28]. Stratification entails dichotomizing or categorizing predicted risks based on meaningful cut-offs and assessing the capacity of the model to classify patients into the defined risk categories [29]. Risk stratification can also be used to compare the incremental value of predictors to existing models using reclassification tables [28]. However, reclassification tables do not account for improvements in risk stratification. The net reclassification improvement (NRI) index was developed to quantify improvements resulting from appropriate risk reclassification by assigning scores based on upward reclassifications in individuals with the disease and downward reclassifications for individuals without the disease (Table 1) [30]. An extension of the NRI, the integrated discrimination improvement (IDI) index, assesses the NRI over all possible cut-offs (Table 1) [30]. Risk stratification is important for implementation of prediction models in clinical practice since it facilitates the identification of high-risk patients and clinically relevant thresholds for targeting prevention strategies. An example of the utility of risk stratification to guide decision making in clinical practice is the stratification of patients into low- and high-risk of perinatal death compared to gestational age alone using the miniPIERS model [31]. Based on the performance of the model across various thresholds of predicted risk, the investigators were able to determine the incremental value of the miniPIERS model above the current standard.

Table 1 Measures of Incremental Value and Clinical Utility

Internal Validation

Once performance measures have been established, internal validation is needed to determine the degree of overfit of the model. Commonly used techniques such as bootstrap resampling and cross-validation allow investigators to report optimism-corrected performance measures [32]. Bootstrap resampling is preferable in particular when dealing with small sample sizes to provide more precise estimates of the variability associated with modeling [15].

External Validation

Once the final model has been found to perform adequately, external validation should be completed to improve the generalizability of the model. External validation is performed in a study population with a different data collection strategy from that used for model development. Several methods for external validation have been proposed including domain, geographic, and temporal validation [18, 33]. All three forms of external validation attempt to capture the potential for differences in model performance based on temporal and geographical trends or heterogeneity in patient populations. A recent study externally validating the fullPIERS model for prediction of adverse outcomes in women with preeclampsia provides an example of the various methods used for external validation [34]. Using three cohorts including women from different geographic locations, with varying periods of follow-up time, and with a broader range of disease (hypertensive disorders in versus of pregnancy versus preeclampsia), the investigators were able to assess the transportability of the model across time and clinical settings.

Assessing Clinical Impact

The final step in prediction modeling involves assessing the clinical impact of the model. The presentation of absolute risks of an outcome without clearly defined decision thresholds is unlikely to modify a clinician’s decision for patient management. Decision curve analyses were developed as a means of quantifying the harms and benefits of treatment over a range of decision thresholds [35, 36]. A decision curve is based on a measure of net benefit (NB) defined as the proportion of true positives penalized for false positives (Table 1) [37]. This measure is weighted by the ratio of over diagnosis (false positives) versus appropriate diagnosis (true positives), which is directly related to the decision threshold. The clinical utility of the final models can be assessed by plotting the range of threshold probabilities using the final models against a “treat all” and “treat none” scenario [36, 38]. For the purpose of establishing clinical utility, discrimination should be prioritized relative to calibration since it facilitates decision making. However, discrimination in isolation cannot determine the impact of the model for use in clinical settings since miscalibrated models can result in increased harm and reductions in the NB of prediction models [18].

The final step prior to implementation of prediction models is to perform impact evaluation studies for clinically relevant outcomes [39]. Impact evaluation studies can be assessed using randomized trials; however, due to time and cost constraints associated with the conduct of such trials, observational or quasi-experimental designs (e.g., pre- and post-designs, regression discontinuity, or differences-in-differences comparing outcomes in populations in which risk prediction models are used to standard of care) can provide a more efficient means of evaluating the impact of prediction models.

Considerations in the Choice of Study Populations for Development and Validation

The big data era has seen an upsurge of prediction models developed using new data sources, including electronic health records (EHR) and administrative health and insurance claims databases. In perinatal epidemiology, these databases have been used to develop prediction models for the risk of early onset gestational diabetes [40], neonatal encephalopathy [41], neonatal sepsis [42], and adverse pregnancy outcomes [43•]. EHRs are digital versions of a patient’s medical chart containing medical and treatment history including laboratory and diagnostic test results, prescriptions, and hospital admissions. The breadth of clinical data in EHRs facilitates the sharing of clinical information across healthcare providers to improve continuity of care. Unlike EHRs, administrative health and insurance claims databases include data collected for administrative or billing purposes (e.g., Medicaid, Medicare, and Kaiser Permanente). The advantages of these databases are that they include a large number of patients followed longitudinally over time. Since these data are not collected for research purposes, their use for development and validation of risk prediction models is limited by the absence of detailed clinical information, inconsistencies in reporting, and discontinuous coverage resulting from changes in insurance providers or eligibility status. However, administrative and claims databases can be leveraged for research purposes through linkage to EHRs, disease registries, or census data.

The availability of more data and larger data sets affords an opportunity to identify novel predictors not previously considered or to include a larger set of predictors. However, the availability of new data sources and machine learning methods may also contribute to the surplus of unvalidated and poorly performing models. For example, there are approximately 1000 prognostic models developed to assess the risk of cardiovascular disease. However, only a limited number of these models have been externally validated, and even fewer are used for decision making in clinical practice [44].

Although EHRs and administrative and insurance databases allow for the inclusion of a larger set of candidate predictors, the data for predictors and outcomes may be less detailed, are subject to measurement error or inconsistencies in reporting due to between-center or between-healthcare provider heterogeneity, and may have a large degree of missingness [45]. Although multiple imputation methods are able to circumvent issues of missing data (with an acceptable degree of missingness), they do not account for the inconsistencies in reporting and data collection. For example, the reporting of spontaneous abortions may vary by time (primarily in earlier databases), by institution, and by healthcare provider in administrative databases largely due to the passive nature of data collection. Moreover, private insurers often provide incentives to improve documentation of clinical and sociodemographic characteristics by healthcare providers compared to administrative databases where documentation is left to the discretion of the provider. EHRs or administrative databases may also lack important predictors that are not routinely collected or recorded. A recent study by Dalton and colleagues showed that a summary measure for neighborhood deprivation outperformed traditional risk factors in the pooled cohort equations risk model for prediction of cardiovascular risk [46]. As articulated by Galea and Keyes, the study by Dalton highlights the uncertainty of the accuracy of individual risk predictions based on a small set of clinical and demographic characteristics [47].

An additional consideration is the transportability of prediction models to different healthcare settings (e.g., socialized versus private healthcare and insurance claims versus population-based). For example, if we are interested in developing a risk prediction model to predict the occurrence of preeclampsia in low-resource settings, using an EHR (from a tertiary care setting) for the development of this model may not reflect the distribution of predictors or outcomes in the target population and impact its generalizability. An additional concern regarding the use of these databases is the potential for selection bias due to informative censoring. Differences in the case-mix in EHRs compared to the general population could result in substantial selection bias as a result of competing events or admissions to different hospitals. In perinatal epidemiology, however, losses to follow-up may be less of a concern since women tend to be younger and have fewer chronic illnesses. Although EHR and administrative and claims databases have become increasingly available, researchers need to consider the limitations of these data and the implications for the accuracy of individual-level predictions and the potential harm to patients based on miscalibrated models [13•].

Considerations for Selection of Predictors

The performance of prediction models is determined by the strength of the predictors included in the final model. The strength of a predictor is a function of both the magnitude of its association with the outcome and its distribution in the population [18]. However, additional considerations are needed to optimize selection of predictors. First, a predictor can only have a small degree of missingness to be considered. If there is an acceptable degree of missingness (30–50%) [48], multiple imputation is preferred to minimize potential selection bias that may occur when using complete cases only. Moreover, investigators need to ensure that predictors not routinely collected or readily available at the time of risk assessment are included in the model as this will reduce the generalizability of the model. For example, the usefulness of the gold standard for assessment of proteinuria, 24-h urine protein, versus a rapid dip-stick for management of women with gestational hypertension at > 37 weeks gestation is debatable due to the lag time associated with laboratory testing. Second, predictors need to be clearly defined using standardized and clinically relevant definitions [33]. Using arbitrary cut-offs or categories for predictors will reduce its transportability to clinical settings. For example, if gestational diabetes is included as a candidate predictor and the threshold for diagnosis used for developing the model is different from the threshold used in clinical practice, it will impact the predictive performance of the model and its transportability into practice. In addition, researchers should be cautious of data-driven categorization of continuous predictors since they may be fitting the idiosyncrasies of the data rather than true associations [18]. Third, the approach to data collection or capture needs to be considered as this may impact the distribution of predictors in the population used for model development or the accuracy of the model for external validation. For example, developing a prediction model to predict adverse obstetrical outcomes using a general practitioner’s database may not capture women at higher risk of experiencing the outcome since these women are typically seen by obstetricians, thus, impacting the generalizability of the model to all pregnant women. Fourth, the temporality of predictors is essential for predictor selection. Prediction models should only include patient characteristics available to clinicians at the time of risk assessment and not those that occur after the outcome. For example, although infant birth weight is a strong predictor of success of vaginal birth after a cesarean section, it should not be included as a predictor since it is not available prior to delivery. Fifth, predictors do not need to be causally related to the outcome. Candidate predictors should be chosen based on substantive and clinical knowledge and not based on their causal relationship with the outcome. For example, there is a lack of evidence to support a causal link between demographic and certain clinical characteristics and stillbirth. However, previous research suggests that socioeconomic status and smoking are strong predictors of stillbirth [49, 50•]. These risk factors should therefore be considered as candidate predictors when developing a prognostic model for stillbirth. Sixth, the predictive value of predictors should not be assessed using measures of association (e.g., odds ratios (OR), risk ratios, and risk differences) [51]. As demonstrated in simulations, predictors would need to have associations of magnitude of OR > 25 to be deemed strong predictors [51]. Researchers should therefore avoid using univariate analyses for selection of predictors and rely on more relevant measures such as the discriminatory ability of models. Finally, investigators tend to measure more predictors than can reasonably be included in the model. For prediction modeling, the number of predictors should be determined by the number of outcomes. To minimize the risk of overfitting or overly optimistic models (higher than expected false positives), the convention is to use the 10:1 rule (ratio of events to predictor) to improve model accuracy [39]. However, more recent work suggests that the 10:1 rule may be too conservative and that the number of predictors should be based on the prevalence of the outcome in the population, the total sample size, and the number of events in the population used for model development [52].

Generalizability of Prediction Models

External validation is essential for the implementation of prediction models in clinical practice. However, it is rarely performed as a result of the limited availability of suitable data, [2, 13•] and when performed, prediction models rarely perform well in external validation, mainly due to study-level or population-level differences with the development dataset [12, 18]. These differences can occur in part due to differences in study design, which can lead to differences in the incidence of outcomes as a consequence of the sampling strategy (e.g., case-control versus cohort) or the mechanism of data collection (e.g., self-report versus physician diagnoses). Differences in the incidence of the outcome between the development and the validation set can reduce the transportability of the model largely due to poor discrimination [13•]. The distribution of predictors may also differ as a result of variations in the case-mix. For example, using a disease registry may result in more-severe patients compared to a primary care or population-based cohort. The accuracy of prediction models in validation sets may also decrease as a function of temporal trends in patient characteristics and outcome distributions. To accommodate such temporal changes and to avoid inappropriately rejecting a potentially useful prediction model, investigators can recalibrate or update models based on population-level differences in the validation set [18, 33]. Discrepancies may also result from differences in standard of care across jurisdictions or availability of resources (e.g., tertiary versus primary care settings or rural versus urban settings). Variations in the strength of predictors can result from overfitting of models or from variations in the definition of predictors and outcomes. This can be minimized through the use of standardized definitions for predictors and outcomes and transparent reporting as described in the Transparent Reporting of a multivariate prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines [53].

Conclusions

Risk prediction modeling provides clinicians with an objective measure of an individual’s absolute risk to guide treatment and prevention strategies. The increasing availability of prediction models developed to predict outcomes during pregnancy and delivery and in the post-partum period highlights the importance of targeting high-risk patients for prevention strategies [54]. However, the utility of risk prediction models in perinatal epidemiology is contingent on the use of appropriate modeling strategies for model development and validation, transparency in reporting of results, and assessment of clinical impact. Additionally, data linkage and data quality need to be optimized in order to facilitate the use of EHRs and administrative and claims databases for development and validation of prediction models and to improve the transportability of models across clinical settings and geographic locations. Population-based pregnancy registries linked to various databases including information from obstetrical visits (including genetic screening, ultrasound, and diagnostic tests), the antepartum and delivery period (including maternal and infant outcomes), neonatal outcomes, past clinical history, and vital statistics should be prioritized for the development and validation of prediction models in perinatal epidemiology. Birth and perinatal registries, similar to those available in Denmark (Danish Medical Birth Register), Norway (Medical Birth Registry of Norway), Finland (Medical Birth Registry), Canada (British Columbia Perinatal Data Registry), and the UK (Clinical Practice Research Datalink Pregnancy Registry) are a few databases that could be exploited for risk prediction in perinatal epidemiology due to the large number of individuals included in these databases, the longitudinal follow-up, and their representativeness of the general population. However, the quality of the data and linkage to external databases (as previously described) needs to be optimized in order to reduce the potential for measurement error and missing data and to improve the accuracy and generalizability of prediction models.

Future research in risk prediction modeling in perinatal epidemiology should focus on updating existing models and adjusting or recalibrating them to the local circumstances or settings rather than developing new models. This way, prediction models may strengthen evidence-based, individualized decision making and can contribute to a rational use of scarce resources. When new prediction models are needed, considerations regarding the clinical setting and the outcomes of greatest importance should be prioritized to increase their transportability to the target population. Despite the challenges of implementing prediction models in clinical practice, they are useful in improving our understanding of how risk factors contribute to the burden of disease and for identifying women and infants who would benefit from available treatments.