Background

Breast cancer, the leading cause of cancer death among women (http://gco.iarc.fr/today/online), is a heterogeneous disease that comprises different molecular subtypes characterized by diverse histological characteristics, aggressiveness, therapeutic strategies, and prognostic implications [1]. Growing evidence supports the application of personalized therapies to cancer patients. However, a reliable method for patient stratification in terms of prediction of therapeutic success is still a major issue, including in breast cancer.

Neoadjuvant chemotherapy (NAC) has become a standard approach for patients with locally advanced breast neoplasms [2, 3]. The backbone of NAC is the combination of anthracycline and taxane, with addition of trastuzumab in cases of human epidermal growth factor receptor 2-positive (HER2+) breast cancer [3]. NAC aims at expanding the surgical indications and facilitating breast-conserving surgery. Furthermore, pathological complete response (pCR) to NAC, defined as the absence of remaining invasive cancer in the breast and axillary nodes on pathological examination of the post-treatment surgical excision specimens (ypT0/TisN0) [4], has been shown to be a surrogate endpoint of long-term event-free and overall survival in breast cancer [5, 6]. The strongest association between pCR and long-term outcome has been reported in the most aggressive subtypes (triple negative, luminal high-grade, and HER2+ hormone-receptor-negative) [6]. On the other hand, patients with pathological residual disease (non-pCR) after NAC are at higher risk of relapse. However, NAC may be limited by a number of toxic side-effects (e.g., hematological ones); consequently, only 73–87% of regimens are completed in the preoperative setting, and impairments affecting daily life may persist even 3 years after treatment itself [7, 8]. Therefore, a staging procedure capable of predicting response to NAC could avoid eventual toxicities in selected patients.

The role of 18F-fluorodeoxyglucose (18F-FDG) positron emission tomography/computed tomography (PET/CT) in breast cancer staging is routinely limited to advanced disease [3, 9]. PET/CT allows imaging of both intratumor heterogeneity and treatment-induced changes. Standard semiquantitative imaging parameters obtained from 18F-FDG PET/CT have been shown to correlate with tumor aggressiveness in breast cancer, and are therefore likely also to correlate with patient outcome [10,11,12,13,14,15,16]. Radiomics, an approach able to quantify lesion heterogeneity using medical imaging, is an emerging translational research topic in breast cancer [17]. Recent literature data indicate that features obtained from 18F-FDG PET/CT correlate with tumor histological characteristics and molecular subtypes [18,19,20,21,22], but there is still limited evidence relating to their role as predictive parameters [21,22,23,24,25].

The main objective of this study was to evaluate the role of advanced imaging features obtained from baseline 18F-FDG PET/CT in combination with clinical and pathological parameters in the prediction of tumor response to NAC in breast cancer patients.

Materials and methods

Study population

Data for patients with a diagnosis of locally advanced breast cancer (n = 291), treated by NAC and surgery at Humanitas Clinical and Research Center between January 2010 and January 2018, were retrospectively collected. Patient inclusion criteria were as follows: (i) histological diagnosis of primary breast cancer, (ii) performance of 18F-FDG PET/CT for staging purposes before any treatment, and (iii) NAC as primary treatment followed by surgery. Ninety-two patients were selected. Patients who had already been treated for breast cancer (n = 2) and those who had suspicion of distant metastasis on the staging 18F-FDG PET/CT were excluded (n = 11). Finally, 79 patients fulfilling the above-mentioned inclusion and exclusion criteria were included in the present analysis. The Institutional Ethics Committee reviewed and approved this retrospective study. Specific informed consent was not required according to local ethics committee rules for retrospective studies.

NAC treatment

A standard NAC protocol with anthracycline and taxane, plus trastuzumab in patients with HER2 amplification [3], was performed in 68/79 patients. In the remaining 11 patients, a different therapeutic regimen was performed in accordance with the oncologist’s indication (detailed in Table 1 of the electronic supplementary material). Surgery was performed after completion of NAC in 77/79 patients. In two patients, the standard NAC protocol was discontinued, and surgery was performed irrespective of failure to complete NAC (these patients had no evidence of response to systemic treatment and severe infection, respectively).

Data collection

Data on programmed clinical examination, tumor marker measurement (cancer antigen 15.3), and radiological follow-up were collected from the medical records (n = 76). Three patients were referred to other hospitals for further oncological management; therefore, their radiological follow-up was not available. Data on survival status were available for all patients.

Histological characteristics of the primary tumor (expression of hormone receptors, HER2 status, Ki67 value, tumor dimensions and grade, histological tumor type, molecular subtype) and clinical characteristics of each patient (age at diagnosis, menopausal state, evidence of disease relapse or death, type of chemotherapy) were retrieved from the institutional database.

Image acquisition

The imaging acquisition protocol has been previously described [21]. Details according to the Image Biomarker Standardisation Initiative (IBSI) reporting guidelines [26] are provided in Table 2 of the electronic supplementary material. Briefly, 18F-FDG PET/CT images were acquired 60 ± 5 min after 18F-FDG injection in accordance with the European Association of Nuclear Medicine (EANM) guidelines, version 1.0 and, from February 2015, version 2.0 [27] using an integrated PET/CT scanner, either a Siemens Biograph 6 LSO (Siemens, Erlangen, Germany) or a General Electric Discovery 690 (General Electric Healthcare, Waukesha, WI, USA). CT images without contrast enhancement or respiratory gating (free-breathing images) were acquired. All PET images were corrected for attenuation using the acquired CT data.

Image analysis

18F-FDG PET/CT images were retrieved from the Picture Archiving and Communication System (PACS). A board-certified nuclear medicine physician with more than 10 years’ experience in PET/CT segmented breast tumor lesions on 18F-FDG PET/CT images using a semiautomatic approach. The volume of interest (VOI) of the breast lesion was defined on PET images with a threshold of 40% of the maximum standardized uptake value (SUVmax) using commercial software (PET VCAR; GE Healthcare). Only the primary tumor was segmented. Textural features (i.e., first-, second-, and higher-order imaging parameters) were extracted using dedicated software for radiomics (LIFEx) (https://www.lifexsoft.org/index.php). LIFEx, for technical reasons, extracted second- and higher-order imaging parameters only for lesions greater than 64 voxels.

Primary endpoint

The primary endpoint of this study was the prediction of pCR to NAC, defined as the absence of remaining tumor on histopathological examination of the post-treatment surgical excision specimens [4].

Statistical analysis

Clinical data, biological characteristics of the tumor, and imaging-derived features were tested as possible predictors of pCR to NAC. Two different approaches (complete case and sensitivity) were used for model building.

The complete-case approach was applied for the main analysis. Accordingly, only subjects with complete variable information were included in the main analysis. Two models were built within the complete-case approach to predict pCR to NAC. In the first model, clinical data, biological tumor characteristics, and first-order imaging features were included. Univariable logistic regression was used to estimate odds ratios (ORs) and 95% confidence intervals (CIs) for the association of each variable and pCR to NAC. Variables satisfying an a-priori set criterion of p < 0.2 were included in the multivariable model. SUVmax and total lesion glycolysis (TLG), identified in the literature as potential predictors of outcome [28, 29], were also included in the model (hereafter referred to as model 1). We further tested whether any clinically meaningful interactions would improve the model fit.

In the second model, second- and higher-order imaging features were also evaluated as possible predictors of pCR to NAC. Features identified at univariable logistic regression as additional predictors of the outcome, along with the variables identified in the first model, were included in the model (hereafter referred to as model 2).

Model performance was evaluated through discrimination and calibration. The discriminatory power of the two models was assessed using the area under the receiver operating characteristic (ROC) curve. Area under the ROC curve (AUC) is the probability that the model we developed will assign a higher probability of the outcome to a patient with the outcome than to a randomly chosen patient without the outcome [30]. Since the dataset used for model development was also used to assess model discrimination, the actual AUC estimates could be over-optimistic [31]. Thus, an internal validation with 100 iterations of tenfold cross-validation was used for AUC evaluation. Calibration of the two models was evaluated by plotting predicted and observed probabilities of outcome [31].

As a second approach, a sensitivity analysis was performed. Multiple imputation with chained equation (MICE) was used to impute the missing data [32, 33]. Two models using two thresholds (i.e., 0.5 for model 3 and 0.4 for model 4) were used for predictor selection. Details for imputation analysis and predictor selection for models 3 and 4 are provided in the supplementary material. Model performance was then evaluated as described in the main analysis.

All analyses were performed using STATA (version 15, StataCorp, College Station, TX, USA) and R Statistical Software (Foundation for Statistical Computing, Vienna, Austria).

Results

Baseline characteristics of all patients included in the study are presented in Table 1. Missing values were observed only for tumor dimension (n = 13) and information on Ki67 expression (n = 2). The mean age of our study population was 50.7 years, and 48% of included patients were postmenopausal women. Thirty-nine percent of patients had HER2+ breast cancer, 32% had triple-negative neoplasms, and the remaining 29% had luminal disease. The mean tumor diameter was 2.95 cm.

Table 1 Baseline characteristics of the study population

Nineteen out of seventy-nine (24%) patients had pCR to treatment, while the remaining 60/79 (76%) had residual disease on surgical specimens.

During follow-up, disease relapse occurred in 16 cases (local recurrence n = 9, distant metastases n = 7), while two patients died.

In the patient with a multifocal tumor, two areas of focal uptake on PET images were separately segmented and considered as two different lesions. Results in respect of the features extracted are provided in supplementary Table 3. Collectively, all textural features (i.e., n = 43) were available in 71/79 cases. In 8/79 patients, lesions were smaller than 64 voxels; hence, only first-order parameters were extracted. Figure 1 provides images of two patients as representative examples.

Fig. 1
figure 1

Representative PET/CT images of a patient with pCR (a) and a patient with pathological residual disease (b) after NAC. a A three-dimensional reconstruction of PET/CT images, axial PET — zoom on the primary breast lesion, fused axial PET/CT, and a transverse view of a three-dimensional reconstruction of PET/CT images (panels a, b, c, and d respectively) in an 81-year-old patient with HER2+ breast cancer, estrogen receptor expression 0%, progesterone receptor expression 0%, Ki67 40%, and HER2/neu +++, with no evidence of residual disease at postsurgical histological analysis after neoadjuvant treatment (12 cycles of paclitaxel/trastuzumab). PET parameters (↑ and ↓: above and below the median value): SUVmaximum 14.3↑, total lesion glycolysis (TLG) 33.5↓, CorrelationGLCM 0.52↑, CoarsenessNGLDM 0.035↑, GLNUGLZLM 4.056↓. b A three-dimensional reconstruction of PET/CT images, axial PET — zoom on the primary breast lesion, fused axial PET/CT, and a transverse view of a three-dimensional reconstruction of PET/CT images (panels a, b, c, and d respectively) in a 37-year-old patient with HER2+ breast cancer, estrogen receptor expression 95%, progesterone receptor expression 90%, Ki67 50%, and HER2/neu +++, with persistence of disease at postsurgical histological analysis after neoadiuvant treatment (four cycles of doxorubicin/cyclophosphamide followed by four cycles of docetaxel/trastuzumab). PET parameters: SUVmaximum 19.2↑, TLG 17.8↓, CorrelationGLCM 0.295↓, CoarsenessNGLDM 0.041↑, GLNUGLZLM 2.139↓

Model selection and model performance

Model 1

Results from the univariable and multivariable models used for selection of the variables in the simple prediction model containing clinical data, biological characteristics of the tumor, and the first-order imaging features are presented in Table 2. Age, molecular subtype, and HER2 status satisfied the inclusion threshold for the univariable analysis. Since the molecular subtype is defined based on the expression of selected receptors, including HER2, the HER2 status was removed from the model due to possible collinearity issues. SUVmax and TLG did not satisfy the inclusion threshold in our dataset (p > 0.2). Nevertheless, we evaluated whether the performances of the model would be improved by adding SUVmax and TLG (Table 2). If the discriminatory performance of the final model was evaluated on the same dataset as was used for the model development, the addition of SUVmax and TLG slightly improved the discrimination (AUC = 0.73 vs 0.71). However, the mean AUC over 100 iterations of tenfold cross-validation was lower for the model with SUVmax and TLG (mean AUC = 0.59, Fig. 2a) than for the model containing only the age and molecular subtype (mean AUC = 0.61, Fig. 2b). Thus, the model with age and molecular subtype was selected as the final model 1. Calibration for model 1 is presented in Fig. 3.

Table 2 Odds ratios and 95% CIs from the univariable and multivariable logistic regression models predicting pCR to NAC using the variables selected in model 1
Fig. 2
figure 2

Variation in the area under the ROC curve (AUC) estimates with the 95% confidence intervals over the 100 iterations of tenfold cross-validation for (a) the model including only age and the molecular subtype and (b) the model including SUVmax and TLG in addition to age and molecular subtype

Fig. 3
figure 3

Calibration of the logistic regression model including only age and molecular subtype (model 1). The size of the circles is proportional to the number of events in the deciles of predicted risk

Model 2

One second-order advanced imaging feature (correlation) and two higher-order imaging features (coarsenessNGLDM and GLNUGLZLM) satisfied the inclusion threshold for the univariable analysis (Table 3). Results from the univariable and multivariable logistic regression using the variables selected in model 2 are presented in Table 3. The actual AUC for model 2 was 0.72 (Table 3). Calibration for model 2 is presented in Fig. 4.

Table 3 Odds ratios and 95% CIs from the univariable and multivariable logistic regression models predicting pCR to NAC using the variables selected in model 2
Fig. 4
figure 4

Calibration of the logistic regression model including only age, molecular subtype, and second- and higher-order imaging features (model 2). The size of the circles is proportional to the number of events in the deciles of predicted risk

Models 3 and 4

Using the threshold of 0.5 (conservative criterion, model 3), age, molecular subtype, HER2 status, correlation, and coarsenessNGLDM were selected for the final model. Age, molecular subtype, estrogen receptors, HER2 status, type of NAC, Ki67, correlation, coarsenessNGLDM, and GLNUGLZLM were selected when 0.4 was used as the threshold (model 4). HER2 status was excluded from both models due to the possible collinearity. The actual AUCs for the two models were 0.70 and 0.73, respectively (Table 4). Calibration of the two models can be seen in Fig. 5.

Table 4 Odds ratios and 95% CIs from the multivariable logistic regression models predicting pCR to NAC using the variables selected over 50 imputed datasets using thresholds set at 0.5 (model 3) and 0.4 (model 4)
Fig. 5
figure 5

Calibration of the logistic regression models including the variables selected over 50 imputed datasets using threshold set at a 0.5 and b 0.4

Discussion

The primary aim of the study was to investigate the relationship between pCR to NAC in breast cancer patients and PET-derived features. In particular, we aimed to assess the predictive power of first-, second-, and higher-order imaging features. The strength of the present study relies on the methodology used for the analysis, determined by the sample size, the retrospective study design, and the number of events (i.e., pCR versus non-pCR). Accordingly, a rigorous statistical plan comparing the complete-case approach (i.e., complete-case dataset) and the sensitivity analysis (i.e., dataset where missing data were imputed), was applied for building models, management of missing data issues, and selection of variables, in order to obtain reliable results not affected by chance. Two models within the complete-case approach for prediction of pCR to NAC were built; a simple model containing clinical data, biological characteristics of the tumor, and the first-order imaging features (model 1), and a more complex model in which second- and higher-order features were added (model 2). Age and molecular subtype proved to be the sole, albeit weak, predictors of outcome in the simple model, with slight improvement in the model’s predictive power upon addition of the textural PET-derived features.

18F-FDG PET imaging parameters have already been proposed as predictors of response to NAC in patients with other tumor types, for example esophageal cancer [34], but information on their predictive role in breast cancer is scarce. Only standard PET parameters (e.g., SUVmax, TLG) have been correlated to response to NAC in breast cancer patients [28, 29]. In our study, however, SUVmax and TLG did not appear to be predictors of pCR to NAC, and adding them to a simple model containing only age and molecular subtype did not lead to an improvement in the model’s predictive power. Very recently, Groheux et al. [35] showed change in 18F-FDG uptake (ΔSUVmax) between PET1 and PET2 (performed before and after two cycles of NAC, respectively) in triple-negative breast cancer to be highly associated with pCR (p = 0.0001), which was not the case for the absolute value of baseline SUVmax (p = 0.11). The AUC of pCR prediction improved (from 0.63 to 0.76) when baseline SUVmax was combined with the genomic grade index (p = 0.016).

In our study, model 1, which included clinical data only (i.e., age and molecular subtype), and model 2, which also incorporated second- and higher-order imaging features (i.e., correlation, coarsenessNGLDM, and GLNUGLZLM), had a similar discrimination performance. Thus, in our hands the inclusion of second- and higher-order imaging features did not improve the discriminatory power of the model. This is somewhat inconsistent with recent evidence of a potential role of high-order PET imaging features as predictors of complete response to NAC in breast cancer [22, 23]. Molina-Garcia et al. [22] found an association between higher-order imaging features, molecular phenotypes, and NAC response, supporting a relationship between local tumor aggressiveness and tumor complexity. Similarly, Yoon et al. [23] found that PET-derived textural features were able to predict responders to NAC and survival in locally advanced breast cancer. A possible explanation for the discrepancy between these findings and our own lies in the study population and in the methods used for the analysis. We evaluated a similar number of cases (79 versus 68 [22] and 83 [23], respectively), but unlike Molina-Garcia et al. [22] we excluded patients with distant metastases (chemotherapy is not a neoadjuvant treatment in the presence of distant metastases [3]), and unlike Yoon et al. [23] we considered only patients presenting pCR as responders since, as mentioned above, this is the only class with a major clinical prognostic relevance. Additionally, in our series, second- and higher-order imaging features were not available for 8/79 patients.

We could not rule out model misspecification issues, with major regard to variable selection. Therefore, in the second approach (based on imputed datasets with complete information, i.e., sensitivity analysis), an alternative specification strategy was used that favored variables with a large effect on the model’s discrimination [36]. The lasso approach allowed an additional issue of our dataset to be addressed (i.e., the combination of a limited number of observations and a large number of potential predictors). Two models (models 3 and 4) were generated using distinct thresholds for predictor selection based on the frequency of variable selection in the imputed datasets. In addition to patient characteristics (i.e., age in both models), tumor markers (i.e., molecular subtype in both models, and estrogen receptor and Ki67 in the 0.4 threshold model only), and type of NAC (for the model 4 only), models 3 and 4 had second- and higher-order PET imaging features as predictors (correlation and coarsenessNGLDM for both, GLNUGLZLM for the second only). These models had actual AUC values of 0.70 and 0.73 respectively, which were close to those of models 1 (0.71) and 2 (0.72). This is not surprising, given that all models shared the variables with the largest predictive weight (i.e., age and tumor molecular subtype), regardless of the strategy that was used for variable selection. Despite poor discrimination, this might indicate the appropriateness of model specification.

Overall, all models displayed limited discrimination performances, probably due to the small sample size, which was further reduced due to the missing data and the complete-case approach used in the main analysis. However, we performed a sensitivity analysis using multiple imputation with chained equation, which led to very similar results. The combination of a large number of predictors with the small sample size could have led to the sparse cell problem if more typical, although highly criticized, methods for model selection had been used, such as backward or stepwise selection. Indeed, in our dataset, we were not able to use these methods for model selection due to the convergence issues. However, univariable models for predictor identification performed at least as well as the more complex models used for dimension reduction, such as lasso regression. Furthermore, due to the small sample size we could not divide our population into training and testing sets, which is recommended for model development and validation. Moreover, alternative validation methods (e.g., leave-pair-out cross-validation) might improve the AUC estimates [37].

In our series, the only characteristic that had some predictive weight was the tumor molecular subtype. This is probably due to the small sample size and the limited number of events (i.e., pCR), which represent limitations both for model building and for cross-validation. It should be noted that the models predict that patients with subtype 2 (HER2+) are more likely to have a pCR to NAC than those with subtype 1 (luminal) (OR = 6.59, CI = 1.22–35.47 in the first model), which is consistent with previous knowledge [38].

To summarize, our study has some limitations. First of all, all clinical data were not available for all patients due to the retrospective nature of the study design. Also, as mentioned before, a greater sample size would improve the results and allow a potentially clinically relevant sub-analysis according to the molecular subtypes. Moreover, second- and higher-order features were not extracted for all patients for technical reasons. However, patient selection and the methods used for data analysis were extremely rigorous. Only patients with locally advanced breast cancer eligible for NAC were included in the analysis. Accordingly, patients with a previous history of breast cancer or metastases (either suspected or confirmed) were excluded. pCR was used as the primary endpoint since it is the only parameter with clinical prognostic significance. Imputation analysis was performed for all missing data.

In conclusion, our findings suggest that radiomics PET-derived features may be associated with pCR to NAC. Therefore, advanced image analysis should be further evaluated in a large population to better clarify and quantitate the potential predictive role of textural features in locally advanced breast cancer patients.

Clinical relevance

Breast cancer patients with pCR to NAC have a better prognosis than partial and non-responders. Predictors of pCR would be of key relevance in daily clinical practice in order to better guide treatment decisions. In this regard, the extraction of biologically relevant information from staging images is increasingly acknowledged as a non-invasive prognostic approach, but its effectiveness is not established yet. Although image-derived biomarkers require a clinical validation before entering routinely into clinical practice, our data confirmed that they are a promising prognostic tool in locally advanced breast cancer.