Introduction

Phyllodes tumor is a rare neoplasm composed of epithelial and stromal components, with a reported incidence of less than 1.0% of all female breast tumors [1,2,3]. Whereas all fibroadenomas are considered benign, phyllodes tumors can be subclassified histologically as benign, borderline, or malignant [4]. Sonography cannot be used to reliably differentiate phyllodes tumors from fibroadenomas [5]. Core needle biopsy often fails to distinguish them [6]. Nevertheless, a preoperative diagnosis is crucial because they require different surgical procedures. Fibroadenomas require only enucleation. In the treatment of phyllodes tumors, complete surgical excision with wide margins of at least 1 cm is required for all grades of phyllodes tumor owing to the high recurrence rate [7, 8].

Phyllodes tumors and fibroadenomas cannot be precisely differentiated on breast MRI [9], but the presence of a cystic component, strong lobulation, heterogeneity on contrast-enhanced T1-weighted images may be helpful for differentiation [10].

Radiomics analysis is an emerging translational field of research aiming to find associations between quantitative information extracted from clinical images and other clinical data. Radiomics analysis refers to various mathematical methods that allow the evaluation of the gray-level intensity and position of the pixels on medical images [11]. It was developed to detect subtle changes that may not be visible to the radiologist’s eye and is believed to reflect tissue microstructure organization [12]. The heterogeneity of a tumor is a feature characterized by areas of hemorrhage, cystic changes, high cell density, necrosis, and myxoid changes [13, 14]. Radiomics analysis of breast MRI has been applied to help differentiate benign from malignant lesions [15, 16], preoperative prediction of sentinel lymph node metastasis [17], treatment response to neoadjuvant chemotherapy [18], recurrence-free survival [19], and breast cancer subtypes [20, 21].

Few studies have investigated the value of quantitative MRI texture features for differentiating phyllodes tumors from fibroadenomas to show their usefulness [22, 23]. Recently, different machine learning classifiers have been compared to determine an optimal machine learning method [24]. Thus, in our study, we investigate three machine-learning classifiers to assess the capability of the radiomics model.

The aim of this study was to develop MRI-based radiomics machine learning model for differentiating phyllodes tumors from fibroadenomas.

Materials and methods

This retrospective study was approved by our institutional ethics committee, which waived the need for informed consent.

Patients

The enrolled patients had histologically confirmed phyllodes tumor or fibroadenoma. Patients were consecutively identified by searching the pathology database from our institution for the period January 2004 to February 2020. Eighty-eight patients with a diagnosis of phyllodes tumor (n = 32; mean age, 47.6 ± 10.8 years) or fibroadenoma (n = 56; mean age, 40.2 ± 12.5 years) were enrolled. The inclusion criteria were as follows: (i) patients with pathologically confirmed phyllodes tumors or fibroadenomas after surgical operation or core needle biopsy; (ii) patients who had undergone breast dynamic contrast-enhanced (DCE) MRI prior to any surgical operation. The exclusion criteria were (i) patients with a maximum tumor diameter less than 10.0 mm; (ii) patients with an MRI performed outside our institution; (iii) poor image quality such as significant motion artifact. The exclusion criteria for a maximum tumor diameter at 10.0 mm was set to minimize the influence of partial volume effects, which might distort the true tissue-specific image texture [25].

Data augmentation

Data augmentation has been proven a powerful method for avoiding overfitting when there is a small amount of data [26,27,28,29]. The small number of patients might lead to potential overfitting, and thus, we naturally augmented the labeled data in our study by obtaining samples from different levels of the tumors [30]. Also, after we considered imbalanced data and the potential consequences of such data with regard to the machine learning schemes [31], the phyllodes tumors were segmented with more samples [30]. On average, four to five levels of data were obtained from the phyllodes tumors, and two to three levels were obtained from the fibroadenomas. The augmentation resulted in 300 labeled segmentation data (150 phyllodes tumors and 150 fibroadenomas) from 88 breast tumors (32 phyllodes tumors and 56 fibroadenomas). The 300 labeled segmentation data were divided into 200 training sets (100 phyllodes tumors and 100 fibroadenomas) and 100 validation sets (50 phyllodes tumors and 50 fibroadenomas).

The workflow of this study is shown in Fig. 1.

Fig. 1
figure 1

Workflow of this study

MRI data acquisition

All examinations were performed with the patient in a prone position. The MR images were acquired using three MRI scanners: 60 patients with a 1.5-T system (Signa 1.5 T, GE Healthcare) with a dedicated eight-channel breast phased-array coil, 20 patients with a 3.0-T system (Discovery MR750, GE Healthcare) with a dedicated eight-channel breast phased-array coil, and 8 patients with a 3.0-T system (SIGNA Pioneer, GE Healthcare) with a dedicated eight-channel breast phased-array coil.

DCE-MR images were acquired using a 3D fat-suppressed T1-weighted volume imaging breast assessment (VIBRANT) sequence composed of one pre-contrast and three post-contrast phases labeled as pre-contrast enhanced, post-CE1, post-CE2, and post-CE3. The gadolinium-based contrast agent was administered at a concentration of 0.1 mmol gadobutrol per kg body weight (Gadavist, Bayer Healthcare Pharmaceuticals) at a rate of 2 mL/s, followed by a 20-mL saline flush at the same rate. The acquisition conditions were as follows: VIBRANT, repetition time (TR) = 7.9 ms, echo time (TE) = 4.3 ms, flip angle = 12°, field of view (FOV) = 34 cm, acquired matrix = 300 × 300, in-plane spatial resolution = 1.1 × 1.1 mm, thickness = 1.1 mm, temporal resolution =  ~ 120 s, axial orientation.

In addition, axial two-dimensional fat-saturated T2-weighted fast spin-echo or iterative decomposition of water and fat with echo asymmetry and least-squares estimation (IDEAL) images were performed under the following conditions: TR = 6680 ms, TE = 68 ms, slice thickness = 3.0 mm, matrix = 320 × 192, FOV = 36 cm.

Reference standard, region of interest segmentation, and radiomics feature extraction

Radiomic feature extraction was performed using the MaZda software (version 4.6, Technical University of Lodz) [32]. For each sequence, six feature categories, histogram, co-occurrence matrix, run length matrix, absolute gradient, autoregressive model, and wavelet transform were extracted using MaZda, as shown in Table 1. The details of the reference standard, region of interest (ROI) segmentation, radiomics feature extraction, and reproducibility analysis are described in Supplemental Material.

Table 1 Radiomics features extracted by MaZda

In this study, we proposed and developed a study scheme by using a machine learning classifier. Figure 2 shows the flowchart of the proposed scheme.

Fig. 2
figure 2

Flowchart of the radiomics analysis

Visual assessment by radiologists

Two radiologists (radiologist 1 with 31 years and radiologist 2 with 16 years of experience in breast MRI) independently reviewed MRI features. The interpretation of MRI features was based on the following characteristics as per the American College of Radiology Breast Imaging Reporting and Data System MR imaging criteria (version 5) [22, 33, 34]. The details of the image analysis are described in Supplemental Material.

Figure 3 shows representative cases of phyllodes tumor and fibroadenoma on T2WI, pre-T1WI, the first phase of DCE-MRI, and the late phase of DCE-MRI.

Fig. 3
figure 3

Magnetic resonance images of phyllodes tumor and fibroadenoma. A 37-year-old woman with a pathologically confirmed borderline malignant phyllodes tumor. a Axial T2WI, (b) pre-T1WI, (c) first phase of DCE-MRI, and (d) late phase of DCE-MRI. A 25-year-old woman with a pathologically confirmed fibroadenoma. e Axial T2WI, (f) pre-T1WI, (g) first phase of DCE-MRI, and (h) late phase of DCE-MRI. DCE, dynamic contrast-enhanced; MRI, magnetic resonance imaging; T1WI, T1-weighted imaging; T2WI, T2-weighted imaging

ComBat harmonization

Radiomics feature values are affected by the different magnetic fields, protocols, and technical settings of the MR scanners. In this study, all radiomics features extracted from images acquired from different MR scanners were harmonized to remove the scanner effect using a ComBat harmonization procedure [35, 36].

The least absolute shrinkage and selection operator logistic regression

Feature selection was required to reduce overfitting, redundancy, or any other type of bias in our radiomics analysis. The least absolute shrinkage and selection operator (LASSO) algorithm was used in the training set for dimensionality reduction and feature selection by performing variable selection and regularization to enhance the prediction accuracy and interpretability of the statistical model produced [37, 38]. The 1- standard error of the minimum criteria (the 1-SE criteria) was used to tune the regularization parameter (λ) and for feature selection using tenfold cross-validation. The LASSO analysis was performed using the “glmnet” package in the R software (R Foundation for Statistical Computing) [39]. The features with non-zero coefficients were selected from the candidate features and formed a radiomic signature for machine-learning classification analysis.

Machine learning classifier

In order to achieve a high and robust performance of classification, three machine learning classifiers, support vector machine (SVM) [40], extreme gradient boosting (XGB) [41], and random forest (RF) [42] were implemented. Model validation was tried in our study with tenfold cross-validation. The classification algorithms were implemented using the ‘caret’ package in R software. The performance of classifiers was evaluated on the basis of the AUC value.

The workflow for imaging data processing and radiomics analysis is presented in Fig. 3.

Statistical analysis

The statistical analysis was conducted with R software (R Foundation for Statistical Computing) [39]. Three machine learning algorithms were applied with the R packages “e1071,” “kernlab,” “xgboost,” and “randomForest.” The performance of classifiers was evaluated using the area under the curve (AUC). AUC, accuracy, sensitivity, and specificity of each classifier were evaluated by the package “MLeval.” The DeLong test was evaluated by package “pROC.” A p value less than 0.05 indicated a statistically significant difference. The group differences were assessed using a Mann–Whitney U test for continuous variables. Univariate and bivariate analyses were performed with SPSS software 26 (IBM Corp.).

Results

Patient demographics and visual assessment of conventional MRI

The details of the patients are shown in Table 2. All of the patients were female. A significant difference was found between the phyllodes tumor groups and the fibroadenoma group regarding age, size, internal septation, strong lobulation, and cyst (p < 0.05). No significant difference was found between the two groups in terms of the heterogeneity of T2WI, the heterogeneity of the late-phase DCE-MRI, and TIC pattern. Older age, larger size, the presence of internal septation, the presence of strong lobulation, and the presence of cyst demonstrated the strongest statistical association with phyllodes tumors.

Table 2 Demographic characteristics of patients and visual assessment of conventional magnetic resonance imaging

Intra- and interobserver reproducibility of radiomics features

A total of 1100 radiomic features for each slice were extracted from T2WI, pre-T1WI, the first-phase of DCE-MRI, and the late-phase of DCE-MRI. Among the 1100 extracted features, 30 features with an interobserver correlation coefficient (ICC) value less than 0.81 for intra- and inter-reader reproducibility were excluded. On the other hand, 1070 features with an ICC equal or higher than 0.81 were included in the subsequent feature selection process.

LASSO logistic regression

To identify the relevant predictors, all explanatory features extracted from MR images of the training set were included in the LASSO logistic regression. Features with regression coefficients of zero were eliminated (Fig. 4a and b). Finally, 35 features (8 histogram features, 8 co-occurrence matrix features, 5 run-length matrix features, 8 autoregressive model features, and 8 wavelet transform features) out of 1070 radiomics features were selected to build the radiomics model (Table 3). The heat map of the selected features is presented in Fig. 5 and shows the distribution differences of normalized radiomics feature values.

Fig. 4
figure 4

Radiomics feature selection using the least absolute shrinkage and selection operator (LASSO) regression model. a Tuning parameter (λ) selection in the LASSO model used tenfold cross-validation via the minimum criterion. The optimal values of the LASSO tuning parameter (λ) are indicated by dotted vertical lines, and a value λ of 0.0078 was chosen. b LASSO coefficient profiles of 1,070 radiomics features. A coefficient profile plot was generated versus the selected log λ values using tenfold cross-validation. Thirty-five radiomics features with nonzero coefficients were selected

Table 3 Radiomics features selected by the least absolute shrinkage and selection operator regression
Fig. 5
figure 5

Heat map of the selected features after least absolute shrinkage and selection operator (LASSO) regression. The x axis refers to radiomic features, and y axis refers to different subjects. Dendrograms regarding radiomics and subjects were displayed to facilitate the visualization of the radiomic patterns. The type of tumor for each subject was indicated by different colors (phyllodes tumor as blue/fibroadenoma as light blue)

Diagnostic performance of the radiomics model with machine learning classifiers

The diagnostic performance for the differentiation of phyllodes tumor from fibroadenoma of the radiomics model was evaluated using receiver operating characteristic (ROC) curves of the training and validation sets. Of the three machine learning classifiers, SVM yielded the highest AUC of 0.99 (95% confidence interval [CI]: 0.98–1.00), with a sensitivity 97.0% (95% CI: 92.0–99.0), specificity 98.0% (95% CI: 93.0–99.0), and accuracy 97.5% (95% CI: 96.0–98.8) in the training set. XGB yielded an AUC of 0.98 (95% CI: 0.96–1.00), and RF yielded an AUC of 0.98 (95% CI: 0.96–1.00) in the training set.

The radiomics model with SVM yielded an AUC of 0.96 (95% CI: 0.92–1.00), with a sensitivity 90.0% (95% CI: 79.0–96.0), specificity 92.0% (95% CI: 81.0–97.0), and accuracy 91.0% (95% CI: 87.9–93.7) in the validation set. XGB yielded an AUC of 0.93 (95% CI: 0.88–0.98) and RF yielded an AUC of 0.91 (95% CI: 0.85–0.97) in the validation set.

The radiological model

To evaluate the diagnostic performance of the visual assessment, a radiological model was built. For the radiological model, the SVM classifier yielded an AUC of 0.77 (95% CI: 0.66–0.88). The sensitivity, specificity, and accuracy of the radiological model were 65.6%, 85.7%, and 78.0%, respectively.

The combined radiomics and visual assessment model

For the combined radiomics and radiological model, the SVM classifier yielded an AUC of 0.97 (95% CI: 0.94–1.00), with a sensitivity of 92.0%, specificity of 94.0%, and accuracy of 93.0%.

The diagnostic performance of the radiomics model, the radiological model, and the combined model are shown in Table 4. ROC curves of the radiomics model, radiological model, and the combined model are shown in Fig. 6.

Table 4 Diagnostic performance of the radiomics model, the radiological model, and the combined model
Fig. 6
figure 6

Receiver operating characteristic (ROC) curves of the radiomics, radiological, and combined models

The combined model and the radiomics model also had a significantly higher AUC than that of the radiological model (p < 0.001 and p < 0.001, respectively). The combined model had a comparatively higher AUC than that of the radiomics model in the validation set, but this was not significantly different (p = 0.391).

Discussion

In the current study, we developed and validated an MRI-based radiomics model for differentiating phyllodes tumor from fibroadenoma. The proposed radiomics model with SVM showed good diagnostic performance for differentiating phyllodes tumor from fibroadenoma, with AUC values of 0.99 and 0.96 when applied to the training and validation sets, respectively. The radiomics model could significantly improve diagnostic performance compared to the radiological model, with an AUC of 0.96 and 0.76 in the validation set. The combined model achieved the highest performance with an AUC value of 0.97. In comparison with the radiological model, our combined model and radiomics model yielded higher performance.

Mai et al. were the first to investigate machine learning–based MRI texture analysis for differentiating phyllodes tumor from fibroadenoma [22]. They reported that a combination of clinical and conventional MRI features with texture features from T2W- short tau inversion recovery sequences yielded the highest AUC of 0.95. Similarly, in the present study, when combined with the radiological model, the discriminative performance of the radiomics model (0.96) can be improved to an AUC value of 0.97. This indicates that radiomics features and conventional MRI features provide supplementary information to differentiate phyllodes tumor from fibroadenoma. To build a robust prediction model, the visual assessment by radiologists should be combined with the radiomics features for high diagnostic ability.

Our study has a number of unique characteristics compared to the previous study.

First, the dataset in our patient cohort was imbalanced regarding the distribution of the two classes, comprising of 32 patients with phyllodes tumors and 56 patients with fibroadenomas; this was due to the low prevalence of phyllodes tumor [43]. The small patient population might have led to a risk of overfitting regarding machine learning-based classifications. We performed data augmentation in order to achieve better class balance and to avoid model overfitting before further evaluation.

Second, ComBat harmonization was used to remove the possible effects caused by different scanners and different magnetic field strengths. MR images were acquired from three different scanners with different magnetic field strengths, which can affect the extracted features. The ComBat harmonization method has been previously used in several MR radiomic studies and has confirmed the effectiveness of the harmonization for MR images, and harmonization did not alter the discriminant information conveyed by the features [36, 37, 44, 45].

Third, generally, a high number of radiomics features are extracted from images in radiomics analysis. Since some features are redundant and unstable, it is crucial to select significant and stable features in machine-learning-based radiomics analysis. Thus, LASSO regression was employed to select robust features and remove redundant features, achieving robust classification performance [37, 38, 46]. In this study, among 1070 features, 8 histogram features, 8 co-occurrence matrix features, 5 run-length matrix features, 6 autoregressive model features, and 8 wavelet transform features were selected by LASSO logistic regression. It can be seen that histogram features, co-occurrence matrix features, and wavelet transform features play important roles in discriminating phyllodes tumors from fibroadenoma.

Finally, Mai et al. applied only one machine learning classifier (k-nearest neighbor) [47] to differentiate phyllodes tumor from fibroadenoma [22]. However, when less data is available, several experiments with various machine learning classifiers could be needed to find the best machine learning scheme. A previous study suggested that the diagnostic performance of radiomics analysis is highly dependent on the choice of the machine leaning classifier [24]. In this study, we applied three machine learning classifiers, SVM [40], XGB [41], and RF [42] and obtained an AUC range of 0.91–0.96 in validation set. Our study found that the radiomics model with an SVM classifier had the highest AUC value among the three machine learning classifiers.

Our study has several limitations. First, any retrospective single-center study may have a selection bias. Therefore, future studies should have a large sample size, multivendor images, and an external test set [48]. Second, our sample size was small, with a tumor class imbalance. Unfortunately, the low occurrence rate of phyllodes tumors determines that a large sample size will be hard to achieve [43]. Small sample size and class imbalance can cause overfitting in machine learning classifications, which we tried to address by applying data augmentation, which has been shown to be successful [27, 29, 30]. Third, the 2D ROIs were manually drawn. Although 3D volumetric analysis had a better performance than 2D analysis [49], we think that 3D VOI would not be clinically practical due to excessive segmentation duration. 2D ROIs are easier to calculate with less time consumption [50]. Fourth, MR image data were collected from three different scanners with different magnetic field strengths, which can affect the extracted features [26, 27]. In the present study, a ComBat harmonization method was used to remove scanner-specific effects from features.

Conclusions

In conclusion, our study demonstrates that an MRI-based radiomics model can differentiate phyllodes tumor from fibroadenoma accurately and robustly and can serve as a valuable clinical tool for the clinical decision-making process.