Highlights

  • Approximately 15% of acute coronary syndrome patients experience a recurrent event in 1 year.

  • Identification of those who would benefit from intensified antithrombotic strategies is important.

  • This analysis was the first to evaluate the performance of machine learning algorithms for prediction of major adverse cardiovascular and bleeding events in acute coronary syndrome patients.

  • Compared to the TIMI risk score and a surrogate risk score, developed in this dataset, machine learning modestly improved the c-statistic for MACE.

  • Importantly, the machine learning model demonstrated remarkable calibration on both efficacy and safety outcomes.

  • This analysis demonstrates a contemporary application of machine learning models to assist in clinical decision making by concurrently stratifying bleeding and thrombotic risk.

Introduction

Approximately 15% of acute coronary syndrome (ACS) patients experience recurrent cardiovascular (CV) events within 1 year [1,2,3]. Individualized patient-level prediction of major adverse cardiovascular events (MACE) and bleeding events among patients with ACS may help to identify those who would benefit from intensified antithrombotic strategies and finally, tailor a personalized approach [4,5,6]. Traditional statistical methods allow inferences to be made regarding populations, but are less powerful than machine learning in making predictions regarding individual patients. Traditional risk stratification scores are built using parametric and semi-parametric regression scoring systems that have several limitations, including primary reliance on linear models and limited capability in explorations of higher order interactions [7, 8]. This is particularly true among patients with extreme risk profiles, as the underlying parametric assumption extrapolates their risk profiles based on population means. These limitations result in sub-optimal performance of traditional risk prediction scores.

In contrast, machine learning explores large datasets and uses algorithms that can learn from and make predictions on data. Additionally, these models have built-in functionality for variable selection and do not require pre-specification of interaction terms. Over the past decade, machine learning techniques have made substantial advances in many domains, including health care [9, 10]. Thus, recent evidence suggests that machine learning methods may offer a powerful alternative to the conventional methods for risk predictions. However, both the accuracy and the application of machine learning models to predict clinical outcomes in individual ACS patients remain unknown. The aim of this analysis was to evaluate the performance of machine learning models to predict the occurrence of MACE and bleeding events among ACS patients, compared to traditional risk stratification methods and to explore contemporary applications of machine learning models in individualized cardiovascular risk prediction.

Methods

Source of data

Data from 24,178 patients who received at least one dose of study drug were pooled from four large randomized clinical trials. The ATLAS ACS-TIMI 46 and 51 trials enrolled 3491 and 15,526 adult patients with ACS, respectively. Patients were randomized to receive rivaroxaban or placebo in combination with either aspirin alone or dual antiplatelet therapy (DAPT) consisting of aspirin plus a P2Y12 inhibitor. The GEMINI ACS trial enrolled 3037 patients with ACS who were on a P2Y12 inhibitor and randomized them to receive either rivaroxaban or aspirin. Finally, the PIONEER AF-PCI trial enrolled 2124 patients with atrial fibrillation who underwent PCI. Patients were randomized to receive either dual or triple rivaroxaban-based antithrombotic regimens or Vitamin K Antagonist (VKA)-based triple therapy. The study designs and primary results for each of the four randomized clinical trials have been previously published [11,12,13,14].

Outcomes

The efficacy outcome was the composite of cardiovascular (CV) death, myocardial infarction (MI), or stroke within 6 months of randomization. The safety outcome was the composite of TIMI (thrombolysis in myocardial infarction) major and minor bleeding and bleeding requiring medical attention within 6 months of trial randomization.

Predictive models

Ensemble learning is a type of machine learning that combines predictions across different candidate models. The Super Learner ensemble method uses cross validation to select weights applied to each candidate model. A total of 23 machine learning algorithms were built using tenfold cross validation. Broadly, the families of candidate models built included Generalized Additive Models (GAMs), Elastic Net (Penalized Logistic Regression), Gradient Boosted Machines (GBMs), Random Forests, a Bayesian logistic regression with default priors, and a naïve Bayes classification model. A total of 48 variables were used for model building (Supplemental Table S1).

For the MACE endpoint, the Super Learner ensemble was compared to three traditional risk stratification tools: (1) TIMI risk score (2) a Surrogate risk score based on the dataset using traditional statistical methods and (3) Stepwise logistic regression. The surrogate model was created by first randomly splitting the data 50/50 into training and test sets, then fitting a logistic regression model to 14 variables thought, a priori, to have the most predictive importance. For the bleeding endpoint, the Super Learner ensemble was compared to a stepwise logistic regression. The detailed statistical methods of these comparator models are described in the supplemental appendix.

Performance measures

Predictive performance was measured in two ways: (1) the ability of the model to discriminate between outcome classes, and (2) the accuracy of the methods probabilistic predictions, called calibration. Discrimination was assessed with a cross-validated concordance statistic (c-statistic). C-statistic comparisons were based on a bootstrapped test of significance. Calibration represents the reliability of models by assessing how closely the predicted risk estimate of a particular patient correlates to the observed event rate for this patient. In this analysis, calibration was assessed via high-resolution non-parametric calibration plots. In the calibration plots, the diagonal line represents perfect calibration with perfect correlation of predicted estimates with observed event rates. Deviations above the diagonal line represent a model that underestimates risk and deviations below the diagonal line represent a model that overestimates risk. The Hosmer–Lemeshow goodness of fit test was used to test for statistical significance between the model and the perfect calibration (diagonal) line. A high p-value on this test is favorable and represents no significant difference from perfect calibration whereby a low p-value represents a significant difference from perfect calibration.

Individualized risk predictions

The secondary objective of this study was to explore the ability of the super learner ensemble to produce clinically relevant individual patient risk predictions. This was done by randomly selecting 3000 patients that had both efficacy and safety outcome assessments, and computing risk estimates according to their antithrombotic regimen. The procedure produces four risk estimates for each patient: (1) The predicted probability of a MACE event on rivaroxaban and (2) on the study-specific control regimen, (3) the predicted probability of a bleeding event on rivaroxaban and (4) on the study-specific control regimen. The patient-specific predicted risk of an event on rivaroxaban was plotted against the predicted risk of an event on the study-specific control (Supplemental Fig. S1).

To assess benefit-risk, a two dimensional plot was derived by calculating the difference between the individual patient predicted risk in the control group and predicted risk in the treatment group for both MACE and bleeding for 3000 randomly selected patients. The plot displays the difference in MACE risk estimates on the Y-axis and the difference in bleed risk estimates on the X-axis (Supplemental Fig. S2).

Results

Baseline characteristics

Of the 24,178 pooled patients 22,955 had both ACS and an efficacy outcome assessment and were included in the efficacy dataset. Similarly, 22,936 had both ACS and a safety outcome assessment and were included in the safety dataset.

Baseline characteristics and outcome summary of the patients in the pooled dataset are shown in Table 1. Overall, the mean age was 61.7 years, 49.2% of patients had STEMI, 64.2% of patients were randomized to receive a rivaroxaban-based regimen and 35.8% to a control. Approximately 66% of patients underwent PCI for the index event, 4.2% experienced a MACE event, and 7.5% experienced a TIMI major or minor bleeding event, or bleeding requiring medical attention.

Table 1 Baseline characteristics

Performance measures

The super learner demonstrated the best discriminative ability for both outcomes achieving a c-statistic of 0.734 for MACE and 0.670 for bleeding (Fig. 1). The best performing candidate model varied according to the outcome. For MACE, the best performing candidate model was the GBM, which achieved a c-statistic of 0.714. For the safety outcome, the best performing candidate model was the Elastic Net, which achieved a c-statistic of 0.669.

Fig. 1
figure 1

Receiver operator characteristics curve. a Shows the receiver operator characteristics curve for the MACE outcome for the super learner ensemble, the best candidate model (GBM), the stepwise logistic regression, the surrogate risk score, and the TIMI risk score. b Shows the receiver operator characteristics Curve for the bleeding outcome for the super learner ensemble, the best candidate model (GBM), and the stepwise logistic regression

The MACE outcome super learner performed significantly better than the TIMI risk score (c-statistic 0.734 vs. 0.489, p < 0.001), the best performing candidate model GBM (c-statistic 0.734 vs. 0.714, p < 0.001), and the surrogate risk score (c-statistic 0.734 vs. 0.644, p < 0.001). The super learner performed similarly to the stepwise logistic regression (0.734 vs. 0.714, p = 0.076). The safety outcome super learner performed similarly to the best candidate model (0.670 vs. 0.669, p = 0.611) and the stepwise regression model (0.670 vs. 0.671, p = 0.946).

Calibration plots suggest that the super learner ensemble demonstrates good calibration for both outcomes (Fig. 2). For the MACE outcome, the super learner calibration plot is close to the perfect calibration line for risk predictions between 0 and 0.3. The Hosmer–Lemeshow test failed to reject good calibration for the super learner (p = 0.612). In contrast, it rejected good calibration for the best performing candidate model (GBM) (p < 0.001), the TIMI risk score (p < 0.001), and stepwise regression model (p < 0.001).

Fig. 2
figure 2

Calibration plots. a Shows the calibration plot for the MACE outcome for the super learner ensemble, the best candidate model (GBM), the stepwise logistic regression, the surrogate risk score, and the TIMI risk score. The 45° diagonal line represents perfect calibration between predicted risk estimates and observed risk. b Shows the calibration plot for the bleeding outcome for the super learner ensemble, the best candidate model (GBM), and the stepwise logistic regression. The 45° diagonal line represents perfect calibration between predicted risk estimates and observed risk

Inspection of the calibration plots for the safety outcome leads to similar conclusions. The super learner demonstrated excellent calibrations for predicted risks between 0 and 0.25. Visually, the best performing candidate model and the stepwise regression were less well-calibrated compared to the super learner ensemble. Formally, the Hosmer–Lemeshow test failed to reject good calibration for the super learner (p = 0.970), the best performing candidate model (Elastic Net) (p = 0.993), and the stepwise regression (p = 0.088).

Individualized risk predictions

The super learner-derived MACE risk estimates were plotted on rivaroxaban vs. control (Fig. 3a). The model predicted that approximately 81% of patients fall above the diagonal line and would have reduced MACE risk on rivaroxaban. Approximately 5% (N = 135) of the 3000 randomly selected patients were below the diagonal line and are predicted to have decreased risk of bleeding with rivaroxaban, compared to the control (Fig. 3b). The combined benefit-risk plot conveys the patient level risk prediction results in a single plot. Using this method, individual differences in treatment benefit or harm may be discerned (Fig. 4).

Fig. 3
figure 3

Individualized predicted risk plot. The points on the predicted risk plot represent a patient’s risk profile for a particular outcome. The diagonal line represents equal risk of the outcome with and without rivaroxaban treatment. Patients below the 45° line are predicted to have lower risk on rivaroxaban versus control. Patients above the 45° line are predicted to have higher risk on rivaroxaban versus control

Fig. 4
figure 4

Individualized benefit-risk plot. The points on the plot represent a patient’s individual predicted benefit-risk profile, based on a combination of that patient’s characteristics. A positive value on the Y-axis represents reduced MACE risk with rivaroxaban treatment and a positive value on the X-axis represents reduced risk of bleed on rivaroxaban, compared to the control arm

Discussion

This analysis is the first report of a machine learning model to predict MACE and bleeding outcomes among patients with ACS enrolled in randomized controlled trials. The super learner ensemble method demonstrated improved performance compared to traditional risk stratification methods. The super learner model improved the c-statistic for predicting ischemic risk compared to the TIMI risk score, a surrogate risk score derived from the dataset, but similar discrimination as the stepwise logistic regression. For bleeding risk, it demonstrated a similar c-statistic as the logistic regression model. The machine learning model produced risk estimates that were highly calibrated with observed efficacy and bleeding outcomes. The calibration of the machine learning model exceeded that of logistic regression and risk scores. This analysis additionally explored a new application of the super learner method to assist the treatment decision making process.

On an individual level, patients and physicians are primarily concerned about the accuracy of a prognostic estimate after an ACS event, and not merely about the overall discrimination of outcomes. Thus, beyond the c-statistic calibration is a key component of risk stratification tools. Indeed, the predicted risk in a poorly calibrated model may over or under estimate the observed event rate. In contrast, in a well-calibrated model, a 5% predicted risk of event corresponds to an observed 5% event rate. Therefore, the most significant advance in this analysis was the high calibration of the super learner model as compared to logistic regression and conventional risk score techniques.

Previous studies have demonstrated inconsistent results regarding the performance of machine learning algorithms as compared to regression models [15,16,17,18,19,20,21,22,23]. In the current analysis, there was a dissociation in the performance of the super learner method in the efficacy endpoint versus the safety endpoint. There may be several reasons for this inconsistency. First, there are numerous types of machine learning models that may fit and perform differently in different datasets. Even among the same types of models, there are countless combinations of tuning parameters that may influence model performance. However, the super learner method provides the advantage of combining any number of models to arrive at the best combined estimate that performs at least as well as the best candidate model. Second, one of the components of the bleeding outcome (bleeding requiring medical attention) is a sensitive endpoint but is non-specific and thus, may be more difficult to predict. Moreover, additional variables that are not captured in the dataset may be associated with the occurrence of an outcome (e.g. genetic or environmental factors) and could improve model performance.

One promise of precision medicine is to identify patients most likely to benefit from treatment and to withhold treatment from those in whom treatment is more likely to cause harm. Notably, the current analysis demonstrates a novel application of machine learning to assist in clinical decision making. Prior traditional analysis methods estimate mean treatment benefits of antithrombotic regimens across populations enrolled in clinical trials. Even if overall classification rates remain similar, this one-size-fits-all approach does not allow for identification of patients more likely to derive larger relative risk reductions from therapy. The machine learning methods employed here incorporate non-linear mappings from exposures to risk such that a differential benefit of rivaroxaban in ACS patients may be identified. The benefit-risk plots of different antithrombotic regimens may be easily visualized and understood by both physicians and patients to facilitate shared decision making. An application for handheld devices can allow real time calculation and display of these results.

A common problem with machine learning models is difficulty in incorporating prior scientific knowledge. It is also difficult to understand exactly how the super learner makes use of the variables to arrive at predictions due to their “black box” nature [24, 25]. In the absence of causal pathways linking exposure to outcomes, the role of machine learning algorithms should not supersede clinical judgement, but rather serve as a tool to guide clinical decisions. Machine learning algorithms may supplement physician decision making by accounting for interactions among variables that clinicians may not be aware of.

Interest in the potential for machine learning in healthcare has recently increased [26]. There have been suggestions that machine learning will drive changes in patient care within a few years, specifically in clinical settings that rely on the accuracy of prognostic models and those based on pattern recognition [9, 27]. For example, deep learning algorithms demonstrated high accuracy in detecting diabetic retinopathy [28], malignant melanoma [29], and in predicting mortality in patients admitted to the ICU [30]. Personalized benefit-risk estimates are one viable application of machine learning algorithms in cardiovascular medicine. Machine learning models may also be used for the enrichment of clinical trials with high-risk individuals that may benefit from a particular investigational therapy, magnifying the expected effect size, increasing power and thus, reducing the sample size.

Limitations

First, the TIMI score was designed to predict short-term mortality and was only available for a subset of patients. However, it is unlikely that the performance estimate is biased enough to compensate for the approximate 20-point difference in c-statistic values. The Killip class variable, required to calculate the GRACE score (which is validated for long term ischemic outcomes) was not available in this dataset. Second, clinical trials are not representative samples of patient populations, possibly limiting the generalizability of the model. Therefore, the model needs to be evaluated in an external data set. Third, the clinical trials evaluated different dosing regimens of rivaroxaban, different control arms, and enrolled ACS patients with and without atrial fibrillation. The differences in enrollment criteria between the included studies introduces population heterogeneity and is associated with different baseline MACE rates. Though in principle, the inclusion of diagnostic variables like atrial fibrillation could account for this difference, in practice, heterogeneous populations often confound prediction tools. More accuracy could possibly be obtained by fitting the model on a subset of patients, though loss of power from decreasing the size of training data could mitigate the effect of having a more homogeneous population. Furthermore, the studies employed different lengths of follow-up, which could also bias our estimator. Further refinement of the model is needed to provide dose and control-specific prediction estimates. Fourth, the super learner ensemble was not compared to a validated bleeding risk score such as the CRUSADE score as the variables required to calculate it were not available in the dataset. Finally, the super learner model was highly calibrated on MACE risk estimates from 0 to 0.3 but predicted few patients (n = 15) with a risk estimate above 0.3, and consequently over-fits the data in this range. Similarly, for the bleeding outcome, a relatively small proportion of patients had a predicted risk above 0.25 (n = 78). However, for both bleeding and efficacy outcomes, patients with a probability above 0.25 are considered at extremely high risk and would warrant maximal medical therapy (for MACE) and caution (for bleeding). Thus, despite loss of calibration in this extremely high-risk range, the model demonstrates excellent calibration for the most clinically relevant range for which the nuances of individual patient characteristics need to be discerned for appropriate clinical decision making.

Conclusion

This analysis is the first to evaluate the performance of machine learning algorithms, built on pooled randomized clinical trial data, for the prediction of MACE and bleeding outcomes among patients with ACS. The super learner produced the highest c-statistic for prediction of MACE compared to traditional risk stratification methods including the TIMI risk score, a surrogate risk score derived from the dataset and a stepwise logistic regression. Importantly, the super learner ensemble method demonstrated remarkable calibration on both efficacy and safety outcomes which greatly exceeded that of traditional logistic regression. This analysis also displayed a contemporary application of machine learning models to assist in clinical decision making based on easily interpreted plots of robust individualized predicted risks of efficacy and safety events in coronary artery disease patients on antithrombotic therapy.