Introduction

With a median age at diagnosis of almost 70 years, acute myeloid leukemia (AML) primarily affects older adults [1, 2]. It is well recognized that advanced age and/or existing comorbid illnesses may limit the tolerance of intensive chemotherapy and resulting pancytopenia [3, 4]. Although our tools to support patients throughout this treatment period have improved over the last 20 years, there is ongoing concern that perceived risks of such therapies may not be commensurate with anticipated benefits in medically less fit individuals [5,6,7]. Consequently, there is interest in accurately assessing the ability of individual patients to tolerate intensive AML chemotherapy—an interest that has greatly increased with recent market approval of several new drugs that have expanded the number of lower intensity therapies for patients with AML [8,9,10]. With increasing treatment options available, evaluating both disease characteristics and the patient’s general medical condition is fundamental to guide informed decision-making regarding suitability for intensive therapies at the individual patient level.

Unfortunately, assessing a patient’s suitability (“fitness”) for intensive chemotherapy has remained a challenge, especially in older patients, those with altered general status, and/or those with comorbidities [11, 12], with no current consensus on how to approach this evaluation. Several scoring systems have been developed over the last decade to predict shorter- or longer-term outcomes in individual patients after receipt of intensive AML chemotherapy to aid this process [12,13,14,15,16]. Some of these scores rely primarily on previously established comorbidity scores with the incorporation of additional parameters related to AML disease characteristics (e.g., Augmented Hematopoietic Cell Transplantation Comorbidity Index [HCT-CI] or AML Composite Model [AML-CM]) while others have been specifically developed to predict early mortality (e.g., Treatment-Related Mortality [TRM] score or MD Anderson Cancer Center [MDACC] scoring system) or rely on consensus from expert panels (e.g., SIE/SIES/GITMO Consensus Criteria or Ferrara score for unfitness) [12,13,14, 16]. Moreover, recently conducted large controlled trials targeting adults with AML deemed unsuitable for intensive AML chemotherapy have put forth their own criteria for unfitness adapted from the Ferrara criteria [9, 10].

While these different approaches have shown some value in identifying subsets of patients at increased risk of early mortality and/or shorter survival in the cohorts in which they were derived, it is unknown how they perform relatively to each other in other patient cohorts. To address this limitation, we therefore sought to examine the predictive value of several of the most widely used approaches to estimate treatment fitness in an independent multi-centric cohort of adults with AML receiving intensive induction therapy.

Patients and methods

Study cohort

For our retrospective analysis, we identified all consecutive adults ≥18 years of age with newly diagnosed AML (WHO 2016) admitted to five university teaching hospitals during a 3-year period (2017–2019) provided they received induction chemotherapy with 3 + 7, idarubicin-cytarabine-lomustine (ICL), or CPX-351 [17, 18]. To ensure exhaustivity, patients were identified after consulting each hospital pharmacy’s registry for induction chemotherapy prescriptions. Patients with secondary AML after antecedent myelodysplastic syndrome (MDS) or myeloproliferative neoplasm (MPN), or after exposure to a leukemogenic agent (previous chemotherapy or radiotherapy) were included. We excluded patients with acute promyelocytic leukemia (APL) or those with acute mixed-phenotypic acute leukemia, those who previously underwent allogeneic hematopoietic cell transplantation (HCT), and those who received any chemotherapy treatment directed at AML before induction chemotherapy, such as hypomethylating agent (HMA) or low dose cytarabine (LDAC). This retrospective study was approved by the Institutional Ethics Committee of the University Hospital of Angers (# 2020/63).

Data collection

We conducted a comprehensive chart review to obtain information on demographics, comorbidities, and cytogenetic/molecular abnormalities. Cytogenetic risk was classified using the refined NCRI/MRC criteria [19]. Cytogenetic/molecular risk was assigned according to the European Leukemia Net (ELN) 2017 risk classification [20]. Since many patients did not undergo next generation sequencing (NGS) testing at diagnosis, ASXL1, RUNX1, and TP53 mutational status was unknown in most patients. The augmented HCT-CI, the AML-CM (Supplementary Table 1), the TRM score, the Ferrara assessment for fitness (Supplementary Table 2), and the recent early mortality score developed from investigators from MDACC (Supplementary Table 3) were calculated as previously described with parameters collected at diagnosis [12,13,14,15,16, 21]. Pulmonary function testing is not routinely performed in our centers and was only available for 10 patients. Also, uricemia was not available in our database and was therefore not used in the calculation of the MDACC scoring system. The criteria used to define unfitness for intensive chemotherapy in two recent randomized clinical trials (RCTs) targeting medically less fit patients were also assessed (“Unfitness RCT eligibility criteria”; Supplementary Table 4) [9, 10].

Endpoints

Twenty-eight and 56-day mortality was used to denote early mortality. Event-free survival (EFS) was measured from the first day of induction therapy to the date of primary refractory disease, relapse from complete remission (CR)/CR with incomplete hematological recovery (CRi) [20], or death from any cause, whereas overall survival (OS) was measured from the first day of induction chemotherapy until death from any cause, with observations censored for patients last known alive [20]. Primary refractory disease was defined as absence of CR/CRi following 2 cycles of induction chemotherapy [20].

Statistical analysis

Patient characteristics are reported as number and percentage for qualitative variables and as median [interquartile range (IQR)] for continuous variables. Unpaired Student’s t-test or the Mann–Whitney U test for the comparison of continuous variables and the chi-squared or the Fisher’s exact tests for the comparison of categorical variables were performed as appropriate. Associations between variables and 28 and 56-day mortality were assessed using logistic regression. EFS and OS were estimated using the Kaplan–Meier method and compared with the Log-Rank test; associations with RFS and OS were assessed using Cox regression. C-statistics, time-dependent AUCs, and Brier scores were used to assess the predictive ability with a graphical approach for time-dependent AUCs and Brier scores over time. An AUC/C-statistic of 1.0 indicates that a model (or covariate) is perfect at prediction, whereas 0.8–0.9 is considered good, 0.7–0.8 is considered fair, 0.6–0.7 is considered poor, 0.5–0.6 is considered very poor, and 0.5 indicates no prediction. All tests were performed with a type I error set at 0.05. Statistical analyses were performed with R (R Foundation for Statistical Computing, Vienna, Austria; http://www.r-project.org). This study follows “Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis” (TRIPOD) guidelines [22].

Results

Patient characteristics and distribution of fitness assessments

We identified 439 consecutive adults with non-APL AML who received induction therapy at 5 academic institutions in France between January 1, 2017, and December 31, 2019. Of these, 11 were excluded because of prior allogeneic HCT (n = 4), previous treatment with non-intensive AML chemotherapy (n = 5), and other types of acute leukemia (n = 2), leaving 428 patients with a median age of 60 years (IQR, 49–66) for analysis. Demographics of this study cohort are summarized in Table 1. Performance status (PS) was 0 to 1 in most patients (92%). Ninety-one patients (21%) had secondary AML (prior solid tumor requiring chemotherapy and/or radiation therapy [n = 28] or prior hematologic malignancy [n = 63]). Most patients received 3 + 7 (n = 240, 56%) and ICL (n = 174, 41%) induction therapy whereas 14 (3%) received CPX-351. The distribution of each parameter used to calculate unfitness RCT eligibility criteria, Ferrara unfit criteria, augmented HCT-CI, and MDACC scoring system according to age categories is summarized in Table 2. At least one criterion defining unfitness in recent RCTs was found in 104 patients (24%) whereas 132 patients (31%) met at least one Ferrara criterion for unfitness. Only 20 patients (5%) were considered at very high risk according to the augmented HCT-CI scoring system or the AML-CM; only 7% of patients had a high TRM score (operationally defined here as a TRM score ≥ 13.1). The MDACC scoring system was low for most patients; 12% and 2% had a high or very high score, respectively.

Table 1 Patient characteristics.
Table 2 Distribution of parameters used to calculate scoring systems stratified by age categories.

Early mortality and survival estimates

We first examined the association of individual risk scores with measures of early mortality and estimates of survival. With a median follow-up of 527 (IQR: 276–870) days among survivors, there were 130 relapses and 139 deaths with 19 (4%) and 26 (6%) deaths occurring at 28 and 56 days, respectively. Several parameters and scores were associated with early mortality including white blood cells (WBC) count at diagnosis (odds ratio [OR] = 1.01 [1.00–1.01], P = 0.007, and OR = 1.01 [1.00–1.01], P = 0.006, for 28 and 56-day mortality, respectively), augmented HCT-CI (OR = 1.32 [1.08–1.61], P = 0.005, and OR = 1.36 [1.14–1.63], P < 0.001, respectively), AML-CM (OR = 1.34 [1.13–1.58], P < 0.001, and OR = 1.39 [1.19–1.62, P < 0.001, respectively), Ferrara criteria (OR = 2.61 [1.03–6.73], P = 0.042, and OR = 3.32 [1.49–7.63], P = 0.004, respectively), TRM score (OR = 1.06 [1.02–1.10], P = 0.002, and OR = 1.07 [1.03–1.11], P < 0.001, respectively), and the MDACC scoring system (OR = 1.36 [1.12–1.64], P = 0.001, and OR = 1.39 [1.18–1.65], P < 0.001, respectively) (Table 3). Figure 1 shows 28 and 56-day mortality in our cohort stratified by the various risk scores. A similar proportion of patients judged fit or unfit according to criteria used for eligibility in recent RCTs targeting less fit individuals experienced early mortality (28-day mortality of 4% vs. 7%, P = 0.27, and 56-day mortality of 5% vs. 9%, P = 0.21, respectively). In contrast, all the risk assessment scores used as categorical variables were associated with 28 and 56-day mortality except for patients with higher TRM scores for whom the difference was only statistically significant for 56-day mortality (Augmented HCT-CI: P = 0.019 and P = 0.017 for 28-day and 56-mortality, respectively; AML-CM: P = 0.010 and P < 0.001; Ferrara criteria: P = 0.035 and P = 0.002; TRM score: P = 0.14 and P = 0.028; and MDACC: P = 0.019 and P = 0.001) (Supplementary Table 5).

Table 3 Univariable analysis of factors associated with 28-day and 56-day mortality, EFS, and OS.
Fig. 1: Distribution of predictive scoring systems by early (28-day and 56-day) mortality. The proportion of patients with early mortality is represented for individual patient categories, as stratified by scoring systems.
figure 1

A RCT criteria for unfitness, B Augmented HCT-CI, C AML-CM, D Ferrara criteria, E TRM score, and F MDACC scoring system. AML acute myeloid leukemia, AML-CM AML composite model, HCT-CI hematopoietic cell transplantation comorbidity index, MDACC MD Anderson Cancer Center, RCT randomized clinical trial, TRM treatment related mortality.

The AML-CM (hazard ratio [HR] = 1.13 [1.07–1.18), P < 0.001, and HR = 1.15 [1.10–1.21], P < 0.001, for EFS and OS, respectively) and the recent mortality score from MDACC (HR = 1.17 [1.10–1.24], P < 0.001, and HR = 1.22 [1.14–1.30], P < 0.001, respectively) were associated with EFS and OS (Fig. 2). The HCT-CI (HR = 1.05 [0.99–1.12], P = 0.10, and HR = 1.07 [1.01–1.14], P = 0.03, respectively) and the TRM score (HR = 1.02 [1.01–1.04], P = 0.01, and HR = 1.03 [1.02–1.05], P < 0.001, respectively) were associated with EFS and OS if scores were used as continuous but not as categorical variables (Table 3). The other scores were not associated with EFS/OS. Similar results were obtained when patients who underwent allogeneic HCT were censored at the time of allografting (Supplementary Fig. 1).

Fig. 2: Outcome of patients stratified by each scoring ssytems.
figure 2

Kaplan-Meier plots of event-free and overall survival according to A, B RCT criteria for unfitness, C, D Augmented HCT-CI, E, F AML-CM, G, H Ferrara criteria, I, J TRM score, and K, L MDACC scoring system. AML acute myeloid leukemia, AML-CM AML composite model, HCT-CI hematopoietic cell transplantation comorbidity index, MDACC MD Anderson Cancer Center, RCT randomized clinical trial, TRM treatment related mortality.

Predictive accuracy of individual risk scores

We next determined the accuracy with which individual risk scores predicted early mortality and survival. Discrimination for OS after induction therapy is represented in Fig. 3 with time AUC curves describing the predictive ability of each score over time. The best predictions of 28 and 56-day mortality were observed with the AML-CM (C-statistic, 0.70 and 0.71, respectively) followed by the MDACC scoring system (C-statistic, 0.66 and 0.68, respectively), the augmented HCT-CI (C-statistic, 0.65 and 0.67, respectively), the TRM score (C-statistic, 0.64 and 0.66, respectively), and the Ferrara criteria (C-statistic, 0.61 and 0.64, respectively) (Table 3). The discriminative ability of the criteria used to evaluate fitness in RCT enrolling less fit patients was very limited at all time points (maximum C-statistic of 0.57 at 28 days). The majority of the risk scores performed better for the prediction of 28 and 56-day mortality than for the prediction of EFS and OS, but the predictability remained similar for the two scores incorporating cytogenetic and molecular variables, namely the AML-CM and the MDACC scoring system (Fig. 3). The accuracy (Brier scores) of all risk scores was somewhat similar although slightly better for the AML-CM, the MDACC scoring system, and the TRM score.

Fig. 3: Time AUC and Brier score for prediction of mortality over time for different scoring systems.
figure 3

AUCs (A) and Brier scores (B) for overall survival according to time are depicted.

Association between individual risk scores, early mortality, and long-term survival in patients older than 60 years

As the ability to predict early mortality may be particularly relevant in older patients, we performed subset analyses restricting our dataset to the 216 individuals ≥60 years of age. This sensitivity analysis was also driven by the fact that patients judged less fit in RCTs enrolling ineligible patients for intensive induction therapy had to be 60 years or older. These 216 patients were more likely to have a PS of 1, rather than 0, in comparison to younger patients (PS 0, 64% vs. 47%; PS 1, 28% vs. 44%, for patients older than 60 years, respectively, P = 0.003) and to have less favorable cytogenetic/molecular disease risk (37% vs. 23% for patients older 60 years, P = 0.001) at the expense of an increase in intermediate risk (26% vs. 39%, P = 0.006) whereas the proportion of patients with adverse-risk disease was similar between younger and older individuals (37% vs. 38%; P = 0.88) (Supplementary Table 6). As all scoring systems include age as a covariate, risk scores were generally higher in older vs. younger. Despite higher risk scores observed in older patients, early mortality was low, with 12 (6%) and 17 (8%) deaths occurring at 28 and 56 days, respectively. The AML-CM and the MDACC scoring systems were associated with 28-day mortality (OR = 1.26 [0.99–1.58], P = 0.049; and OR = 1.30 [0.99–1.66, P = 0.047, respectively) whereas the augmented HCT-CI, the AML-CM, the TRM score, and the MDACC scoring system were associated with 56-day mortality (OR = 1.42 [1.13–1.79, P = 0.002; OR = 1.42 [1.16–1.75], P < 0.001; OR = 1.07 [1.02–1.12], P = 0.003; and OR = 1.39 [1.10–1.74], P = 0.004, respectively) (Supplementary Table 7). As in the whole cohort, EFS and OS were only associated with the AML-CM (HR = 1.10 [1.02–1.19], P = 0.010; and HR = 1.17 [1.07–1.28], P < 0.001, respectively) and the MDACC scoring system (HR = 1.13 [1.02–1.24], P = 0.015; and HR = 1.16 [1.02–1.32], P = 0.024, respectively) (Supplementary Table 7, Supplementary Fig. 2).

Discussion

Several risk scores have been developed to assess the suitability of individual adults for intensive AML chemotherapy to help in the decision-making process. In our study, all risk scores identified a small subgroup of patients, with size number varying from one score to another, with increased early mortality. However, overall, the prediction accuracy of all scoring systems was limited with the best performance having a C-statistic close to 0.70 for early mortality. Only scoring systems including AML characteristics could help in predicting long-term survival. Criteria used to define unfitness in recent randomized clinical trials enrolling patients judged unsuitable for intensive chemotherapy did not perform well to stratify patients or to predict outcome. Despite the association with outcomes seen with several risk scores, the predictive accuracy overall was limited, highlighting important limitations of our ability to estimate fitness.

The use of previously elaborated scores identified a subset of “high-risk” patients with increased early mortality although early mortality rates in our cohort were low (28 and 56-day mortality rates of 4% and 6%, respectively), while 28-day mortality varied from 5% to 11% in the cohorts used to develop or validate theses risk scores [12, 16]. One may speculate that improvement in supportive care and the use of less toxic induction therapy regimens, such as CPX-351, although seldomly used in our institutions, may explain this lower early mortality rate. It is important to point out that the proportion of patients identified as high risk for poor outcome varied greatly between scores, from 2% for the MDACC scoring system to 30% for the TRM score, which contrasts to the training and validation cohorts used to develop these scores were the subset of high-risk patients was usually higher (22% for both the augmented HCT-CI and the AML-CM, 11 to 20% for the TRM score, and 9% for the MDACC scoring system) [12, 14, 16]. On the other hand, the Ferrara unfit criteria was observed in 31% of patients, near identical to the rate observed in the study by Palmieri et al., but with an impact on early mortality but not overall survival [12].

We examined the comparative predictive accuracy of scoring systems which was lower than in previous training and validation cohorts. Whereas the prediction value for 56-day mortality were 0.72 and 0.78 in the validation cohorts for the augmented HCT-CI and the AML-CM, respectively, these values were lower in our cohort (0.67 and 0.71, respectively) [14]. This was also observed for the Ferrara criteria and the TRM score which had C-statistics of 0.76 and 0.72 for 28-day mortality, respectively (0.61 and 0.64 in our cohort) [12]. As in previous studies, we observe that scoring systems including AML-specific characteristics, i.e., ELN-2017 classification for the AML-CM, secondary-type AML for the TRM score, and cytogenetics for the MDACC scoring system, were slightly more accurate in predicting OS [12]. We can hope that recent advancements in molecular characterization of AML by next-generation sequencing can further improve the prediction ability of these scoring systems [23, 24].

Assessing a patient’s suitability for intensive chemotherapy remains a challenge. Whereas most RCTs in adults with AML enrolled patients eligible for intensive chemotherapy in the last decade, 13 (17%) specifically enrolled patients judged unfit for intensive chemotherapy [25]. In earlier trials, this was determined by the local investigator’s own evaluation but in most recent years, eligibility criteria for “non-eligible” patients for intensive AML chemotherapy have been proposed. Although 24% of patients in our cohort would have judged not suitable and thus eligible for participation in trials evaluating less intense induction therapy regimens, these patients had a similar outcome than those judged fit. Despite the absence of formal validation, these criteria will probably be more frequently used in the future, to compare therapies in medically less-fit patients. Our results suggest that it would be more appropriate to use scores that have been specifically developed to determine which patients have a greater risk for early treatment mortality and might therefore not be eligible for intensive chemotherapy [12, 14, 16].

Although these scores are not routinely used in our institutions, there is undeniable patient selection in our cohort of patients who received intensive chemotherapy. This might explain why the proportion of patients classified as high-risk by some scores was low. Since patients in our cohort were treated more recently than patients treated in the training cohorts used to elaborate scoring systems, we cannot exclude that some high-risk patients may have rather received less-intense chemotherapy in our institutions. Similar selection processes are present in most institutions, perhaps not to the extent that is observed in our cohort, which may limit the generalizability of scoring systems. This further emphasizes the importance of external validation when developing such tools. The predictive ability of some scores may have been further limited in our cohort due to missing data used to calculate them, including uricemia levels at diagnostic and pulmonary function testing. Since all scoring systems, including the augmented HCT-CI, were evaluated retrospectively, some relevant comorbidities may have been under reported. As some criteria used to calculate these scores were occasionally observed, it was not possible to assess how individual criteria contributed to the accuracy of these scoring systems.

Previously developed risk scores could identify a small subgroup of patients with increased early mortality after intensive AML therapy, but accurate individualized prediction was relatively limited. Criteria used to define suitability for intensive AML chemotherapy in recent RCTs targeting patients judged less fit for intensive chemotherapy did not perform well to stratify patients or to predict outcome.