As management of breast cancer advances, incorporating tumor biology in decision making is paramount. Tumor biology, response to neoadjuvant chemotherapy (NAC), and nodal status are several key factors that help decide the patient’s optimal therapy. Various clinical tools and nomograms are available online that incorporate specific tumor factors to help with treatment decisions.1,2,3,4,5,6 The use of NAC in operable breast cancer continues to increase, especially in triple-negative breast cancer (TNBC) and human epidermal growth factor receptor 2 (HER2)-positive (HER2+) disease.7 NAC allows for in vivo assessment of treatment response and can result in downstaging disease in the breast and axilla, and de-escalation of breast and axillary surgery.7,8,9,10

Several prospective studies have led to a change in the way surgeons manage the axilla in breast cancer patients treated with NAC. The SENTINA trial, SN FNAC trial, and the American College of Surgeons Oncology Group (ACOSOG) Z1071 trial showed that sentinel lymph node (SLN) surgery can be performed for node-positive women with a good response to NAC, and has led to an increase in the use of SLN surgery for these patients.11,12,13

When considering SLN surgery after NAC for a patient with cN+ disease at diagnosis, the patients most likely to benefit from this approach are those most likely to have a complete nodal response. Preoperative models that predict the likelihood of the patient achieving a pathologic complete response (pCR) in the axilla after NAC are helpful to guide this decision making. Multiple models have been published predicting axillary pCR after NAC in varying cohorts.2,3,4,5,6 Authors  at the Mayo Clinic used the National Cancer Database (NCDB) from 2010 to 2012 to construct two models that showed good discrimination for predicting (pathologic node-negative) pN0 status following NAC in patients who were clinically node-negative (cN0) and clinically node-positive (cN+) at presentation.9 The model for patients with cN+ disease had an area under the receiver operating characteristic curve (AUC) of 0.71 and was created using data from 9494 patients, of whom 3314 (34.9%) achieved a nodal pCR. In another study from MD Anderson Cancer Center (MDACC), 584 cT1–4, N1, M0 patients who underwent NAC from 2001 to 2013 were studied. Two predictive models (AUC 0.765 and 0.776) were created using their database. In their model, 37.2% of patients had nodal pCR.3

The goal of the current study was to compare three validated predictive models for nodal response in patients with cN+ disease treated with NAC to assess model generalizability to a different setting and to compare model performance and clinical utility, both overall and by biologic subtype, in our patient cohort.

Methods

With Institutional Review Board approval, patients with cT1–T4, cN1–N3 breast cancer who underwent surgical treatment with axillary staging after NAC at Mayo Clinic Rochester from 2008 to 2019 were reviewed. All cN+ patients included were pathologically confirmed by percutaneous biopsy. Patients with distant metastasis or prior history of breast cancer were excluded. Data regarding patient demographics, tumor biology, and surgical procedure were obtained from the breast surgery registry, and imaging studies were reviewed for number of abnormal nodes on axillary ultrasound. Patients were classified as nodal pCR (ypN0) if they had negative nodes or isolated tumor cells, and as ypN+ if they had micrometastasis or larger (> 0.2 mm) at surgery after NAC. The predicted probability of ypN0 status was calculated using logistic regression model coefficients from the three published models; these coefficients did not appear in the original MDACC publication but were provided by the study authors upon request. The Mayo predictive model included age, clinical tumor category (cT), clinical nodal status (cN), grade, histology, and tumor biologic subtype (estrogen receptor-positive [ER+]/HER2-negative [HER2−], ER+/HER2+, ER-negative [ER−]/HER2+, and ER−/HER2−). The MDACC has two predictive models. The MDACC 1 predictive model included clinical tumor category (cT), multifocal/multicentric disease, number of pretreatment sonographically suspicious nodes (< 4 vs. ≥ 4), nuclear grade, HER2±, ER± and progesterone receptor (PR)±. The MDACC 2 predictive model was similar to MDACC 1 except it did not include the number of suspicious nodes as a variable, but did include histology (lobular vs. mixed or ductal).

Statistical Analysis

To assess discrimination, the AUC was estimated and was reported with 95% confidence intervals for each model. AUCs were compared across models using the DeLong error method. Calibration was assessed by comparing the mean predicted probability from each model to the observed probability of ypN0 status using a one-sample t test and via loess calibration curves showing the relationship between the observed probability and the predicted probability on a continuous scale. To assess utility for clinical decision making, we classified each patient as having ≥ 50% or < 50% probability of ypN0 status by each model. This classification was then compared with actual pathologic node status to estimate specificity (i.e. probability that a ypN0 patient was predicted as having ≥ 50% probability of node-negative status) and sensitivity (i.e. probability a ypN+ patient was predicted as having < 50% probability of node-negative status). Using a probability of 50% likely represents a common rationale when reducing a continuous probability to a binary clinical choice—in effect dichotomizing, according to which of the two outcomes is most likely. In addition to the cut-point of 50%, we also identified the optimal cut-point for each model as the value that minimized the absolute difference between sensitivity and specificity, and reported performance metrics also at that cut-point. Sensitivity and specificity values were compared between models using McNemar’s test for paired proportions. P-values < 0.05 were considered statistically significant. Analysis was performed using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA) and R version 3.6.1 (The R Foundation for Statistical Computing, Vienna, Austria).

Results

Overall, 581 patients with cT1–4, N1–3 breast cancer treated with NAC and meeting the study inclusion criteria were identified. Median patient age was 50 years (range 24–86). Clinical tumor category distribution was 13.3% cT1, 49.7% cT2, 32.4% cT3, and 4.6% cT4. The majority of patients (519 [89.3%]) in the study had cN1 disease, and a minority (208 [35.8%]) were multicentric or multifocal in nature. Of the 581 patients, 156 (26.9%) had SLN only, 146 (25.1%) had SLN followed by completion axillary lymph node dissection (ALND), and 279 (48.0%) had ALND only. The median number of nodes removed in cases with SLN only was four. The nodal pCR (ypN0) rate was 43.5%, with 188 (32.4%) patients with ypN1 disease, 90 (15.5%) with ypN2 disease, and 50 (8.6%) with ypN3 disease (Table 1).

Table 1 Demographic and clinical characteristics of the cohort including 581 patients treated with NAC for clinically node-positive breast cancer, overall and by pathologic node status

Comparing patients who achieved a nodal pCR (ypN0) and those with residual nodal disease (ypN+), there was no difference in clinical N category at presentation, or the proportion of patients with fewer than four abnormal lymph nodes on ultrasound (Table 1). Nodal pCR was higher with smaller tumors at presentation (cT1 and cT2). Higher grade tumors (more poorly differentiated) were more likely to achieve nodal pCR. Younger patients (median age 49 vs. 52 years; p = 0.006) had higher nodal pCR rates. Biological subtypes played a significant role in the nodal pCR rate, with HER2+ disease having the highest rate of nodal pCR (68.9%) and ER +/HER2− having the lowest rate (23.9%; p < 0.001) (Table 1).

Overall, the three predictive nomograms had very similar AUC values, ranging from 0.761 to 0.769 (p = 0.80), with good discrimination. Given the importance of tumor biologic subtype, we further assessed model performance separately by biologic subtype and found that the AUCs within the subtypes were similar across the three models (Table 2 and Fig. 1). Not surprisingly, the within-subtype AUC values were generally lower than for the overall cohort, ranging from 0.67 to 0.70 in ER +/HER2−, 0.64–0.66 in HER2+, and 0.66–0.68 in ER−/HER2−.

Table 2 Performance of three predictive models, overall and separately for each biologic subtype
Fig. 1
figure 1

Receiver operating characteristic curves for the prediction of ypN0 status for three clinical prediction models, assessed for the entire cohort of 581 patients treated with NAC, and also separately by biologic subtype. NAC neoadjuvant chemotherapy, MDACC MD Anderson Cancer Center, ER+ estrogen receptor-positive,  ER− estrogen receptor-negative, HER2 human epidermal growth factor receptor 2, HER2− HER2-negative, HER2+ HER2-positive, TNBC triple-negative breast cancer

To assess model calibration, the mean predicted probability from each model was compared with the observed probability (Table 3). The predicted probability from the Mayo model (mean 34.9%) was significantly lower (p < 0.001) than the observed probability of ypN0 status in the cohort (43.5%), while MDACC 1- and 2-predicted probabilities (mean 42.3% and 41.7%, respectively) were similar to the observed probability. This poorer calibration for the Mayo model is also apparent in calibration curves (Fig. 2a), where the Mayo model estimated a lower probability of ypN0 status than the observed probability over much of the range, whereas the MDACC models showed more consistent concordance between predicted and observed probability. When examined based on biological subtypes, HER2+ patients have the highest predicted probabilities, followed by the ER−/HER− patients (Table 3, Figs. 2b–d), while the predicted probabilities of ypN0 status were markedly lower for ER+/HER2− patients in all models.

Table 3 Comparison of the mean model-predicted probability of ypN0 status with the observed percentage of ypN0 to assess model calibration, overall and by biologic subtype
Fig. 2
figure 2

a Loess calibration curves showing the relationship between the model predicted probability of ypN0 status and the observed probability for the cohort of 581 patients treated with NAC. bd Side-by-side boxplots comparing the predicted probability distribution between ypN0 and ypN+ patients separately, by subtype for the b Mayo model, c MDACC 1 model, and d MDACC 2 model. NAC neoadjuvant chemotherapy, MDACC MD Anderson Cancer Center, ER+ estrogen receptor-positive, ER− estrogen receptor-negative, HER2 human epidermal growth factor receptor 2, HER2− HER2-negative, HER2+ HER2-positive, TNBC triple-negative breast cancer

Considering specificity, at a cut-point of ≥ 50% predicted probability of ypN0 status, MDACC 1 and 2 each identified 157/253 (62.1%) ypN0 patients as having ≥ 50% probability of negative nodes, whereas the Mayo model identified only 103/253 (40.7%; p < 0.001). By subtype, MDACC 1 predicted probability ≥ 50% identified 107/131 (81.7%) HER2+ patients and 45/58 (77.6%) ER−/HER2− patients with ypN0 status. MDACC 2 performed similarly, while the Mayo model identified only 78/131 (59.5%) and 25/58 (43.1%) of patients with nodal pCR, respectively (p < 0.001). For the ER+/HER2− patients with ypN0 status, none of the three models identified them well at the ≥ 50% threshold (specificity 0–9.4%). Considering sensitivity, at the threshold of < 50% predicted probability of ypN0 status, the Mayo model identified a higher percentage of ypN+ patients (294/328, 89.6%) compared with the MDACC 1 and 2 models (251/328, 76.5%; p < 0.001) (Table 2).

Using optimal cut-points that maximize sensitivity and specificity instead of the arbitrary 50% threshold, the optimal thresholds varied by biologic subtype (Table 3) and were highest in HER2+, followed by ER−/HER2−, and substantially lower (21–22%) in ER+/HER2−. The reason for this is clear when examining the distribution of predicted probabilities by subtype (Fig. 2b–d) split by ypN status. The distribution of predicted probabilities for a given subtype is strongly related to the prevalence of ypN0 status in that subtype; thus, almost no ER+/HER2− patients have ≥ 50% probability of nodal pCR when considered among the cohort as a whole, making this cut-point not useful for discriminating between ypN0 and ypN+ patients in this subtype. However, applying a lower threshold, such as ≥ 22% predicted probability, MDACC 1 achieved useful results for identifying patients with a higher likelihood of ypN0 status within the ER+/HER2− subtype, with sensitivity and specificity of 65.2% and 65.6%, respectively (Table 3).

Discussion

Predictive models that estimate the likelihood of converting from cN+ prior to NAC to ypN0 after NAC may be useful for clinical decision making. In this study, we compared the performance of three previously validated predictive models using readily available clinical variables. All three models performed well overall with respect to discrimination. Calibration was superior for the MDACC models; however, further differences were identified when assessing performance at the cut-point of 50% predicted probability. At this cut-point, the Mayo model had better sensitivity for correctly classifying patients with ypN+, while the MDACC models generally had better specificity for identifying patients with ypN0; however, all models had poor specificity (< 10%) in the ER+/HER2− subset at this threshold.

The MDACC models performed better than the Mayo model with respect to calibration, likely because the prevalence of ypN0 status in the MDACC model development sample (37.2%) was closer to the prevalence in the sample assessed here (43.5%), whereas the NCDB sample from which the Mayo model was built had a lower prevalence of nodal pCR (34.9%). Correspondingly, the predicted probabilities of ypN0 status were generally higher in the MDACC 1 and 2 models than the Mayo model. This emphasizes the importance of validating the generalizability and transferability of a predictive model developed in one setting to another setting.14 The MDACC model may have performed better in the Mayo patient cohort because of the similarity of their clinical settings, since both are academic and tertiary referral centers. However, the Mayo model was developed in the NCDB, which includes data from more than 1500 Commission on Cancer facilities in diverse community and academic settings nationwide; thus, it is possible that the Mayo model is more relevant for the general population and that its calibration would be better in other settings.

For the MDACC 1 and 2 models, a key difference between the two models was the inclusion of the number of suspicious nodes on baseline axillary ultrasound. We found very similar results for these two models and little benefit from a statistical perspective to adding the ultrasound findings. This is important, especially in some institutions where ultrasound data are not readily available or documented precisely, and a clinical prediction model that does not require this information could be useful.

The percentage of patients with ypN0 status varied markedly by subtype as expected.11, 15,16,17,18,19 Likely due to availability of targeted therapy, HER2+ patients had the highest response, with 68.9% ypN0. ER−/HER2− and ER+/HER− patients had ypN0 rates of 47.2% and 23.9%, respectively. These data were mirrored in the predicted probability of ypN0 status for each subtype. We saw that HER2+ had the highest predicted probabilities, followed by ER−/HER2− and, lastly, ER+/HER−. This is important because it explains why a probability cut-point of 50% uniformly applied did not perform equally well in all subtypes and identified very few (< 10%) of the ypN0 patients in the ER+/HER2− subtype who might be candidates for SLN surgery. A lower probability of nodal complete response is inherent in the ER+/HER2− subtype, making 50% a poor threshold for decision making within this subtype since ≤ 5% of ER+/HER2− patients had ≥ 50% predicted probability of ypN0 status by any model. Thus, using optimal, subtype-specific cut-points provided discrimination between ypN0 and ypN+ patients in ER+/HER2− patients similar to that found in other subtypes. Because tumor biology was such a strong discriminatory feature, the AUCs for the models applied to the entire cohort were also higher than the within-subtype AUCs, which were generally only fair and < 0.70. This represents a possible opportunity to improve performance with the development of subtype-specific prediction models.

The Mayo model performed better when assigning a probability < 50% of ypN0 status to ypN+ patients with sensitivity of 89.6% compared with both MDACC models at 76.5%; higher sensitivity means the model would appropriately select a higher percentage of ypN+ patients for ALND. The MDACC 1 and 2 models are more specific than the Mayo model and were thus more likely to identify ypN0 patients as having ≥ 50% predicted probability of nodal pCR, resulting in an accurate choice to perform SLN surgery and spare the patient ALND. Considering the cost–benefit ratio in this situation, where the choice is to perform SLN surgery or not, one could argue a model with higher specificity that correctly identifies more of the ypN0 patients (to consider SLN surgery) is superior to a model that identifies fewer of the ypN0 patients. Alternatively, choosing the more sensitive model better identifies patients who are likely to remain ypN+, and therefore could be considered for ALND to limit operating time and mitigate the false negative rate.

The clinical use of these predictive models is in their utility in counseling patients about the likelihood of complete nodal response following NAC. This gives patients and physicians the ability to have a frank discussion in the preoperative setting regarding their specific surgical management. For patients at high likelihood of complete nodal response, they can be considered for less extensive axillary surgery, such as SLN surgery with or without targeting the clipped node, or the MARI procedure (marking the axillary lymph node with radioactive iodine seeds).20

Limitations of this study include the use of prospectively collected data evaluated in a retrospective fashion. Several other prediction models have also been published to predict the likelihood of node-negative disease after NAC for patients with node-positive disease.1,2,3,4,5,6 Our study was limited to those models using clinical variables standardly available in the clinical record, and did not compare all available models in the literature. All three models that we evaluated have unique features that may be clinically obtained and universally applicable to patients in all types of medical centers. We feel that the benefit of increased clinical utility outweighs the limitations when using predictive nomograms as it relates to axillary management after NAC.

Conclusion

The three models to predict nodal response to NAC all performed well overall with respect to discrimination, demonstrating their clinical utility to select patients likely to benefit from SLN surgery, especially patients with ER−/HER2− or HER2+ disease. At a 50% threshold, MDACC models 1 and 2 had better specificity than the Mayo model; however, the Mayo model had better sensitivity. None of the models performed well for ER+/HER2− patients at the 50% threshold.