Introduction

Triple-negative breast cancer (TNBC) is one of the molecular subtypes with the worst prognosis among all breast cancers [1, 2]. It is characteristic of no expression of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2). TNBC is associated with high heterogeneity, aggressive proliferation, and low differentiation [2,3,4]. As a result of these biomolecular characteristics, the treatment of TNBCs has been a challenge without effective drugs for well-defined molecular targets. The high chance of recurrence and distant metastasis results in a poor prognosis among TNBC patients [1,2,3,4,5,6].

Ultrasonography is an important imaging tool for breast diseases. In our previous study, we addressed the heterogeneity of sonographic features of TNBCs [7]. The most common sonographic features for TNBCs included regular shape, no angular/spiculated margin, posterior acoustic enhancement, and no calcifications which are characteristic of benign breast masses [7]. These benign-like TNBCs are associated with a high cellular proliferation rate and poor differentiation [7] which indicated a high risk of recurrence and distant metastasis.

Unfortunately, these benign-like TNBC masses especially in young patients are easy to be missed [8]. An early and accurate recognition of this kind of breast tumor with aggressive biological property will, therefore, be beneficial for improving clinical outcomes. Thus, benign-like breast masses require more attention during ultrasound examinations, particularly by less experienced doctors. The challenge for sonographic physicians calls for more advanced methods to improve diagnostic performance.

In the last decade, the concept of radiomics emerges and flourishes. Computer-aided analysis converts imaging information into quantitative numerical data using a series of computational algorithms [9,10,11]. Numerous studies have confirmed that information obtained from medical images such as ultrasound and magnetic resonance imaging (MRI) is closely related to the characteristics of genes, proteins, and tumor phenotypes [9, 10, 12,13,14,15]. However, the assessment of ultrasound images was operator-dependent with large intra- and inter-observer variability [16]. Computer-aided feature analysis is expected to reduce the variability among observers. Our preliminary results showed that the high throughput quantitative analysis for ultrasound images of breast cancers was reliable [17] which could be used to predict the biological behavior of breast cancers [18] and TNBCs [19].

In the present study, the performance using quantitative high-throughput sonographic feature analysis to predict pathological and immunohistochemical (IHC) characteristics of TNBCs was compared with that using qualitative feature assessment.

Materials and methods

We reviewed the clinical data including preoperative ultrasound reports and images, surgical records, postoperative pathological, and IHC results from 6758 patients who accepted breast cancer surgeries at our center from June 2014 to June 2019. Patients presented as a solitary mass on preoperative ultrasound images and confirmed as TNBC by postoperative pathology were eligible for the study. The exclusion criteria included the following: bilateral or multiple masses, recurrences, previous breast cancer surgeries, neoadjuvant chemotherapy, or mass diameter larger than 5 cm and ultrasound images with poor qualities. Finally, 252 eligible TNBC patients were included. Our study acquired ethical approval from the institutional review board at Fudan University Shanghai Cancer Center. Informed consent was waived as the data were retrospectively collected.

Ultrasound equipment and preoperative image acquisition

Ultrasound equipment used in the study included Aixplorer (SuperSonic Imagine), Logic E9 (GE Healthcare), xMATRIX and IU22 (Philips Medical Systems), Aplio 500 (Cannon Medical System), and Mylab90 and MylabTwice (Esaote). The high frequency (5–14 MHz) linear array transducer was used for the scanning of breast masses. Ultrasound images were recorded as the format of digital imaging and communication in medicine (DICOM).

Qualitative sonographic feature assessment

All ultrasound images were reviewed by two experienced US physicians who were blinded to the patients’ clinical characteristics and histological results. The sonographic features of TNBC masses were assessed referring to the Breast Imaging Reporting and Data System (BI-RADS) [20]. In the present study, TNBCs were assessed in terms of tumor shape (regular and irregular), angular or spiculated margin (yes and no), posterior acoustic pattern (shadow, enhancement, no change, and mixed change), and calcification (yes and no). A consensus was reached after discussion when there was a disagreement between the two ultrasound physicians.

Image selection for radiomics analysis

One typical ultrasound image which mostly matches with the morphologic description of the breast mass was selected for each patient to perform computer-aided radiomics analysis. Firstly, the region of interest (ROI) delineating margins of the mass was outlined on the selected ultrasound images by the ultrasound physician ZJ Zhao for radiomics feature analysis. The data set was randomly divided into the training set and testing set at a ratio of 7:3. Figure 1 shows the flowchart of the computer-aided radiomics analysis.

Fig. 1
figure 1

Flowchart of the computer-aided radiomics analysis

Feature extraction from the US image

A total of 1688 high-throughput radiomics features based on the ROI were extracted to analyze the internal heterogeneity from 10 types of images for each ultrasound image. These image types include original, wavelet, Laplacian of Gaussian (LoG), square, square root, exponential, logarithm, gradient, and local binary pattern 2-dimensional (LBP2D). All features are grouped into the following seven feature classes: first order, shape, gray level co-occurrence matrix (GLCM), gray level size zone matrix (GLSZM), gray level run length matrix (GLRLM), neighboring gray tone difference matrix (NGTDM), and gray level dependence matrix (GLDM) (Table 1).

Table 1 Radiomics features of each feature class

Feature selection and classification

To remove the redundant features for the purpose of reducing overfitting, machine learning was used for data analysis to select the most robust radiomics features correlating with the biological features. In this study, the combination of principal component analysis (PCA) [21] and least absolute shrinkage and selection operator (LASSO) [22] algorithms, named PCA + LASSO method, was used to perform feature selection.

The PCA algorithm comprehensively considers all high-throughput features and reduces the feature dimensions according to the selected coefficients. And then the LASSO algorithm extracts feature with high correlation with biological characteristics from the reduced-dimensional features. This algorithm is suitable not only for linear cases but also for nonlinear cases. Table 2 shows the number of selected radiomics features after using PCA + LASSO method for each biological property.

Table 2 Number of radiomics features after dimension reduction for each biological property

After selection, the selected features were input to the support vector machine (SVM) classifier for further classification. Three kinds of classification models were proposed and compared: (I) all features, (II) features selected with the PCA method, and (III) features selected the combined method of PCA + LASSO.

Postoperative pathology and IHC

The postoperative specimens were fixed in formalin, embedded in paraffin, and stained with hematoxylin–eosin (HE) to prepare for histological and IHC assessments. Before the preparation of histological specimen, the tumor size was firstly determined based on the gross sample. The pathological characteristics evaluated by HE staining included pathological type, nuclear grade, status of lymphovascular invasion, papilla invasion, and axillary lymph node metastasis. Based on the pathological type, all patients were divided into infiltrating ductal carcinoma, infiltrating lobular carcinoma, and other types of invasive breast carcinomas. Based on nuclear grades, all patients were divided into three groups: grade I (highly differentiated), grade II (moderately differentiated), and grade III (poorly differentiated).

IHC analysis was performed to determine the expression of ER, PR, HER2, and Ki67. The negative expression of ER and PR was defined as less than 1% nuclei staining. HER2 status was considered as negative when IHC was 0 or 1 + , or HER2 amplification was absent (ratio < 2.2) in the fluorescent in situ hybridization (FISH) test. TNBCs were defined as the negativity of ER, PR, and HER2 in accordance with the St. Gallen International Expert Consensus [23]. The expression level of Ki67 was based on the ratio of the nucleus with positive staining.

Patients were divided into two groups according to the pathological grade: low grade (I and II) and high grade (III); two groups according to Ki67 level: < 40% and ≥ 40%; and two groups according to HER2 score: low score (0 or 1) and high score (2); axillary lymph node metastasis (yes and no); and lymphovascular invasion (yes and no). The subgroup according to HER2 score was based on our previous finding that the higher HER2 score (2 + with FISH negative) was associated with the higher chance of calcifications in TNBCs [7]. The cutoff of 40% was used for Ki67 level of TNBC cohort as the median Ki67 expression level was 60–70% and mean Ki67 level was about 60% according to our previous experience. A cutoff of 20% or 14% in the guideline would cause a bias that most TNBCs cases are of high Ki67 expression (> 20% or 14%). Therefore, in our TNBC-related articles, the cutoff of 40% was used as the criterion for defining Ki67 expression level [7, 24, 25].

Statistical analysis

The statistical analyses for qualitative sonographic features were performed using SPSS for Windows version 22.0 (IBM Corp.). Continuous numerical data were presented as mean ± standard deviation SD (range) or median (interquartile range, IQR) after testing the data normality with the Kolmogorov–Smirnov test. Categorical data were presented as frequency (percentage %, 95% confidence interval CI). Multivariate logistic regression analysis was used to identify independent qualitative sonographic features that were correlated with the pathological characteristics of TNBCs. Odds ratio (OR) was calculated for the qualitative sonographic feature with the best predicative value.

Machine learning was used for analyzing the high-throughput features. The PCA algorithm was used for dimension reduction, which projected the feature space owning all 1688 features into a smaller subspace and ensured the loss of feature information when the overall influence of the original subject information is not large. The LASSO algorithm was used to select contributory variables from the afore-obtained features of reduced dimension by adding penalty terms. In addition, the small coefficients were compressed to be 0 with insignificant variables discarded. Figure 2 shows the coefficients of selected features in the LASSO model.

Fig. 2
figure 2

The confusion matrix of high-throughput radiomics features analysis. Column: the predicted category; Row: the true category of the data. a Histological grade; b Ki67; c HER2; d ALNM; e LVI

The predictive efficiency for pathological and IHC data using quantitative and qualitative sonographic features was evaluated by sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) [26]. The two-tailed p value less than 0.05 was considered statistically significant.

Results

Table 3 shows the demographics, surgical information, postoperative pathology, and immunohistochemistry of the 252 patients. Nine patients were excluded for histological grade–related analysis due to missing data. All 252 TNBC patients had an average age of 50.9 years old (SD 11.7, range 22–82) and an average tumor size of 2.3 cm (SD 0.8, range 0.8–4.9). Most patients were infiltrative ductal carcinoma (96.0%) with high histological grade (84.0%), high Ki67 level (median 70%), but without axillary lymph node metastasis (68.7%) or lymphovascular invasion (67.5%).

Table 3 Demographics, surgical, and pathological features of 252 patients. Data are presented as mean ± SD (range) for age and tumor size, median (IQR) for Ki67 expression level, and frequency (%, 95% confidence interval, CI) for categorial data

The combination of four sonographic features including regular tumor shape, no angular or spiculated margin, posterior acoustic enhancement, and no calcification was used to predict the clinical, histological, and IHC characteristics of TNBCs as shown in Table 4. The AUC was 0.673 (p = 0.001) for histological grade, 0.680 (p < 0.0005) for Ki67 level, 0.651 (p = 0.01) for HER2, 0.587 for ALNM (p = 0.026), and 0.566 (p = 0.088) for LVI. Posterior acoustic enhancement was the predominant sonographic feature responsible for high histological grade (OR = 3.81). No angular/spiculated margin was the predominant sonographic feature for high Ki67 expression (OR = 2.31). Calcification was predominantly associated with high HER2 score (OR = 2.48). There was no independent sonographic feature associated with axillary lymph node metastasis.

Table 4 The efficacy of using two-dimensional ultrasound features (regular shape, no angular/spiculated margin, posterior acoustic enhancement, and no calcification) to correlate with pathological and IHC characteristics of TNBCs

The performance of radiomics analysis to predict the biological property was displayed in Table 5. Compared with the method of PCA algorithm, the combination of PCA and LASSO algorithms increased the AUC value of each biological property by 46.0–88.4%. Meanwhile, compared with directly using all 1688 features for classification, after using the method of PCA and LASSO, the AUC value for predicting biological features increased 30.0–116.7%. With a certain number of selected features, the AUC was 0.942 for histological grade (34 features), 0.732 for Ki67 (27 features), 0.730 for HER2 (25 features), 0.804 for ALNM (34 features), and 0.795 for LVI (34 features). Figure 3 shows the confusion matrix reflecting accuracy of classification for each biological property. In the confusion matrix, the higher value on the diagonal indicates the greater possibility that the predicted category matched with the actual category. In Fig. 3, the number on the non-main diagonal is very close to 0, which means that the probability of misjudgment in our model's prediction process is very small. For example, for histological grade, the values on the non-main diagonal are 1 and 2, which are much smaller than 11 and 59 on the main diagonal, which means that the model has good performance.

Table 5 The efficacy of using high-throughput sonographic features to correlate with pathological and IHC characteristics of TNBCs
Fig. 3
figure 3

The coefficients of selected features in the LASSO model. The ordinate lists the serial numbers of certain features, and the corresponding color bars representing the coefficients of selected features show the significance of the corresponding features. a Histological grade; b Ki67; c HER2; d ALNM; e LVI

Discussion

While being recognized as an aggressive disease, TNBC-related research has been immensely studied in terms of imaging features, clinical outcome assessment, and therapeutic target exploration [1, 7, 27,28,29,30]. As a result of the high heterogeneity of biological property at cellular and genetic levels [28, 29, 31,32,33,34], clinical outcome of TNBC is highly diverse among patients. Similarly, the imaging appearances of TNBC showed a great variety in accordance with its biological and clinical characteristics [7, 27, 35, 36] as illustrated in Figs. 4 and 5. The heterogeneity of sonographic features hindered the early and accurate diagnosis for TNBCs, especially for those TNBCs with benign-like sonographic appearance. In the literature, it was reported that some TNBCs are prone to be confused with fibroadenomas [37]. These TNBCs with benign sonographic features tend to show more proliferative and aggressive biological properties such as high histological grade and high Ki67 expression level [7]. In the present study, we used the quantitative high-throughput feature analysis to further validate the association between sonographic features and biological property. Our results showed that both qualitative and quantitative sonographic features of TNBCs are associated with tumor biological characteristics. The quantitative high-throughput feature analysis is superior to two-dimensional sonographic feature assessment in predicting tumor biological property.

Fig. 4
figure 4

TNBC with regular tumor shape, circumscribed margin, and posterior acoustic enhancement at sonography in a 27-year-old female patient (BI-RADS: 4A). a Gray-scale US image; b Color Doppler US image; c HE staining showing the nuclear pleomorphism and nuclear mitosis (original magnification × 400, histological grade III); d IHC staining for Ki67 quantification (original magnification × 200, Ki67 80%)

Fig. 5
figure 5

TNBC with irregular tumor shape, angular, and spiculated margin at sonography in a 48-year-old female patient (BI-RADS: 4C). a Gray-scale US image; b Color Doppler US image; c HE staining showing the sporadic nuclear mitosis (original magnification × 400, histological grade II); d IHC staining for Ki67 quantification (original magnification × 200, Ki67 30%)

Most TNBCs have a higher histological grade and higher cellular proliferation rate than other molecular subtypes of breast cancers. The active and rapid growth of TNBCs results in less matrix reaction which leads to clear boundaries between tumors and surrounding normal tissues [38, 39]. Some TNBCs presented inactive growth pattern which allowed enough time to have interaction with host cells, leading to fibrosis, inflammation, and neovascularization [38, 39]. These interactions result in angular burrs on the margins of tumors, which are the areas where normal breast tissue and tumor tissue cross-grow. These differences in growth characteristics lead to variability in sonographic features of TNBC [7].

This is the first study to compare the performance of quantitative and quantitative sonographic feature assessment for TNBCs. The AUC value for quantitative features was higher than that for the qualitative sonographic features. Traditional medical imaging techniques including ultrasonography were based on the general anatomical and morphological images of tissues, organs, or lesions. The images were subjectively interpreted, while the tumor biology-related information hidden in the images was not well-considered [9]. Medical images also contain digitalized information in addition to displaying conventional descriptive imaging signs visualized by naked eyes [9, 10, 40]. High-throughput image analysis is the kernel part in radiomics and artificial intelligence. By digitizing the information hidden in the image, the high-throughput image analysis is capable of associating the imaging phenotype with the tumor biological characteristics [9, 10]. Relevant studies have also confirmed that the information obtained from medical imaging is closely related to the characteristics of genes, proteins, and tumor phenotypes [9, 10]. In the past decade, numerous studies on radiomics and radiogenomics have emerged and have proved that radiomics can assist clinical decision-making in many ways. It has been shown that radiomics analysis of MRI can predict the histological outcome of breast cancers [41] and also predict the therapeutic effect of neoadjuvant chemotherapy [42]. Breast ultrasound automated diagnostic module, named S-detection technology, based on the sonographic appearances of breast masses has been launched and applied with promising clinical results [43]. Our previous studies also demonstrated that high-throughput sonographic features analysis was reliable and reproducible for breast cancers and can assist in predicting the expression of hormone receptors in invasive breast cancers [18]. A recently published study found that radiomics method can provide a high diagnostic performance in the differential diagnosis of fibroadenoma and TNBC [37].

Nowadays, the individualized treatment of TNBCs mainly focuses on the biomolecular characteristics detected by genomics and proteomics [28, 30,31,32,33, 44, 45]. TNBCs can be divided into 4–6 subgroups based on cytokeratins [33], transcriptomes [32, 45], or genomics [28, 31] which are associated with the clinical outcomes of TNBCs. Infiltrative tumor border pattern was more associated with the luminal cluster and pushing border pattern was more associated with the basal cluster which showed better clinical outcome compared with the luminal group [33]. This finding might indicate that the malignant like TNBCs might have a poorer prognosis than those with benign like appearances. This is similar as Wang et al found that vertical orientation was associated with worse outcomes and a greater chance of LNM in axilla [46]. However, these results were controversial with our finding that basal-like and immune-suppressed (BLIS) groups, presenting a higher chance to be benign-like sonographic features tended to have a poorer prognosis than other molecular subgroups such as immunomodulatory (IM), luminal androgen receptor (LAR), and mesenchymal-like (MES) TNBCs [25]. But the routine application of proteomics and genomics in clinical practice is still in challenge due to the complicated process and high expenses. Imaging features are always the initial information acquired before the treatment of breast cancers and the acquisition of genomics and proteomics. If there were associations between radiomics, proteomics, and genomics, the imaging characteristics can assist in predicting the biological properties of TNBCs. This will be valuable in treatment decision-making and prognosis prediction. This project has been initiated through collaborations with the Department of Breast Surgery at our cancer center.

Our results should be interpreted after considering the limitations. First, images used for quantitative and qualitative assessment were retrospectively retrieved from the image archives. The still images may not fully depict sonographic features of breast masses. This aspect is very important in traditional ultrasound and should be investigated. A future proposal with the automated breast ultrasonography (ABUS) system may allow to fully explore the breast with an automatic system and to keep the examination entirely [47]. Second, our results were exclusive for tumor size larger than 5 cm and non-invasive breast carcinomas. In the process of feature dimensionality reduction using PCA, the generated new features were the combination of the original dimensions, which fully summarized the information contained in the original feature space. This resulted in the poor interpretation of the generated features. Last, the AUCs for quantitative and qualitative methods were difficult to be compared as they were acquired with different algorithms.

Conclusion

High-throughput sonographic feature analysis using the combination of PCA and LASSO is superior to two-dimensional feature assessment in terms of predicting tumor biological characteristics and clinical behavior of TNBCs.