Introduction

Breast cancer is the most frequently diagnosed cancer in women. Metastatic breast cancer will develop in approximately 30%–40% of patients with invasive breast cancer [1, 2] and accounts for approximately 15% of cancer deaths of women [3]. Morbidity and mortality tend to increase as the age at diagnosis becomes younger. Studies show a close correlation between a large precancerous space in breast cancer patients and tumor size, histological grade, lymphatic invasion, lymph node metastasis, and prognosis, which could be used as an important marker to predict prognosis [4, 5]. To explore the risk factors related to the postoperative recurrence of breast cancer, a construct predictive model to help identify patients at high risk of recurrence after surgery, and the timely standardization of treatment for patients with recurrence, will be important to improving the prognosis and quality of life of these patients.

Machine learning is now being applied to medical decisions [6, 7]. The machine-learning processes range from fully manual to fully automated procedures that can process medical data without manual intervention [8]. Machine learning models have produced a means of predicting the risk of in-hospital mortality, using 17 variables to estimate the risk for patients in intensive care unit, with an accuracy of 94% [7]. Deep machine learning has been applied in plastic surgery [9]. However, there are no reported studies on machine learning and cancer-specific survival after invasive breast cancer surgery. The aim of this study is to predict cancer-specific mortality after surgical resection of primary non-metastatic invasive breast cancer using a machine-learning algorithm.

Methods

Study population

This study involves a secondary analysis of data from 1661 women with primary non-metastatic invasive breast cancer. The data analysis are available at the BioStudies database [10] (accession number: S-EPMC4658156).

Study variables

Clinical pathology factors, such as axillary lymph node status, menopausal status, age, pathological diagnosis, tumor size, histological grade, hormone receptor and HER-2 status, date of last follow-up, and death related to cancer, were collected. The laboratory data in this study included globulin, total bilirubin (TB), albumin, lactate dehydrogenase (LDH), uric acid, cholesterol, fibrinogen, and triglycerides.

Machine learning

The decision tree (tr) is a binary or multifork tree. This is a supervised learning algorithm.

Logistic regression (lr) is a simple classification algorithm. Logistic regression was used to classify the data by fitting the boundaries of the classification. “Regression” also means the “best fit”, for which it is necessary to find the best fitting parameter. Some optimization methods can be used to calculate the best regression coefficient.

Random Forest (RF) is a classification algorithm that works by forming a large number of decision trees in training and testing. In testing, it outputs classes as class (classification) patterns.

LigthGBM (gbm) is a new member of the boosting set model. The optimizing points of the LightGBM are the “Histogram Algorithm”, “Growth Strategy Optimization of the Tree”, and “High Efficiency”.

Gradient Boosting Decision Tree Classification (Gbdt) is a member of integrated learning. This method adopts the additive model (linear combination of basic functions) and forward step-by-step algorithm. From the weak learning algorithm, a series of weak learners are obtained repeatedly, and then strong learners are obtained by combining weak learners. When each weak learner is a CART tree, it is a lifting tree. A square error is generally used as loss function for regression problems, exponential loss function is used for classification problems, and general loss function is used for general problems.

Statistical analysis

The R language was used for general statistical calculation. The primary endpoint was the cancer-specific survival rate (CSS) calculated from the date of diagnosis to the date of cancer-related death or last follow-up. The mean and standard deviation were calculated and tested for differences by t-identification. Differences between categories were assessed using the Chi-square test. The python language was used for machine-learning modeling and Pearson’s correlation coefficient was used for correlation analysis. The machine-learning model adopts the following models: logistic regression, decision tree, random, LightGBM, GBDT, and forest. We divided 80% of the overall data into training groups for development and verified 20% by test groups. Missing values were processed by multiple interpolation. The AUC value is between 0 and 1, the greater being better. The related code is shown in Supplementary material 1.

Results

Patients’ characteristics

Table 1 compares the clinical characteristics of the surviving patients (surviving group) and the patients who died (deceased group).The age difference between the surviving group and the deceased group was significant (p = 0.046). The tumor size was significantly larger in the deceased group (p < 0.001).

Table 1 Clinical characteristics of the patients

Variable importance and correlation analysis

The results of the machine-learning Gbdt algorithm for cancer-specific death caused by various factors rank the top five important factors from high to low as follows: regional lymph node metastasis quantity, LDH, triglyceride, plasma fibrinogen, and cholesterol (Fig. 1). The correlation analysis results show that these five important factors are directly proportional to poor prognosis after breast cancer surgery (Fig. 2).

Fig. 1
figure 1

Variable importance of features included in the machine-learning algorithm for the prediction of postoperative cancer-specific outcomes of patients after breast cancer surgery. Note: SD standard deviation, IDC invasive ductal carcinoma, ILC invasive lobular carcinoma, ER estrogen receptor, PR progesterone receptor, HER-2 human epidermal growth factor receptor-2, LDH lactate dehydrogenase, TB total bilirubin

Fig. 2
figure 2

Correlation among variables

Machine-learning algorithm models for breast cancer-specific mortality in the training group

Table 2 summarizes the performance characteristics of the models in the training group. Among the five algorithm models, the rate of accuracy was highest for Forest (0.877), followed by the Gradient Boosting algorithm (0.863). Among the five algorithms, the AUC values were 0.973 for Forest, 0.835 for Gradient Boosting, 0.804 for gbm, 0.747 for Logistic, and 0.726 for DecisionTree. Among the five algorithms, Forest had the highest precision rate (0.987) and a recall rate of 0.316 (Fig. 3).

Table 2 Forecast of results for the training group
Fig. 3
figure 3

Different machine-learning algorithms predict the postoperative cancer-specific outcomes of patients after breast cancer surgery in the training group

Machine-learning algorithm models for breast cancer-specific mortality in the test group

Table 3 summarizes the performance characteristics of the models in the test group. Among the five algorithm models, DecisionTree had the highest accuracy rate of 0.841, followed by the gbm algorithm (0.838). Among the five algorithms, the AUC values were 0.755 for GradientBoosting, 0.755 for gbm, 0.733 for Logistic, 0.715 for Forest, and 0.677 for DecisionTree. DecisionTree had the highest precision rate (0.667) and gbm had the highest recall rate (0.220) (Fig. 4).

Table 3 Forecast of results for the testing group
Fig. 4
figure 4

Different machine-learning algorithms predict the postoperative cancer-specific outcomes of patients after breast cancer surgery in the test group

Discussion

Breast cancer is a global issue. According to statistics, about 180,000 cases of breast cancer and 25,000 cases of non-malignant breast tumors are diagnosed each year in the USA [11]. Although the comprehensive treatment of surgery, endocrine therapy, and targeted therapy has improved the overall survival of patients with breast cancer, most patients are still at risk of recurrence and metastasis after surgery. Early postoperative diagnosis and treatment remains a key factor in achieving longer survival. The findings of the present study suggest that machine learning is helpful for predicting cancer-specific mortality outcomes after surgery for patients with surgically resected invasive breast cancer. Furthermore, the results of the Gbdt algorithm for cancer-specific mortality showed that the top five important factors were the number of regional lymph node metastases, LDH, triglycerides, plasma fibrinogen, and cholesterol, respectively, in that order.

Axillary lymph node status is an independent predictor of the recurrence and prognosis of breast cancer. The arrangement of axillary lymph node status on the diagnosis and treatment measures, including the operative method and adjuvant therapy, as well as the psychological status of patients, also has important influence [12]. Previous studies have shown that the 5-year survival rate of breast cancer patients with negative axillary lymph node metastasis exceeds 82%, which decreases to 73% if there are 1–3 lymph node metastases, and then to 46% if there are 4–12 lymph node metastasis [13]. Moreover, about 75% of patients without lymph node metastasis have no relapse within 10 years, whereas 65% of those with 1–3 positive lymph nodes and 85% of those with 4 or more positive lymph nodes suffer relapse within 10 years of local treatment [14, 15]. Our research also shows that lymph node metastasis is one of the most important factors for cancer-specific death after breast cancer surgery.

Elevated serum lipid levels may increase the risk of breast cancer and are an important indicator of tumor prognosis. Studies have suggested that the cause of breast cancer is closely related to the concentration of total cholesterol (TC) and triglyceride (TG) in serum [16]. Studies on breast cancer suggest that the onset of breast cancer may be associated with hypercholesterolemia [17]. Moreover, high TC, TG, and hypertension are associated with the invasiveness of breast cancer, and these breast cancer patients have higher histologic grades [18]. Studies have also shown that TC and LDL-C levels are significantly higher in breast cancer patients than in HR + patients, and that TNBC is often accompanied by low-density lipoprotein cholesterol levels [19]. Hux et al. [20] explained the mechanism of occurrence as follows: in the process of breast cancer proliferation, the metabolism of TG in breast cancer tissue is significantly more active than that in the surrounding tissues, increasing the blood TG level, which decreases the testosterone-estradiol-binding globulin level. This in turn increases the free estradiol concentration in the body, stimulating the abnormal proliferation of mammary epithelium, thus promoting the occurrence and progression of cancer [20]. This explanation is supported by our findings and suggests that high LDH levels are an important risk factor for the survival and prognosis of breast cancer patients [21]. Martina et al. conducted a similar survival analysis of patients with liver metastasis and breast cancer and reported that the overall survival of patients with a high lactate dehydrogenase level was only up to about 10 months, whereas that of those with a normal lactate dehydrogenase level was up to 60 months [22]. A study with more than 10 years of follow-up also showed that the lactate dehydrogenase level was an independent factor for the survival of patients with recurrent breast cancer [23]. Again, these results were similar to those of this study.

Fibrinogen has been associated with distant metastasis of malignant tumors [24]. Altiay et al. identified a trend of higher fibrinogen levels in patients with stage IV lung cancer than in those with stage IIIa/b lung cancer, but the difference did not reach significance [25]. Takuchi et al. also found a strong positive correlation between preoperative fibrinogen levels and the depth of tumor invasion in patients with primary esophageal cancer [26]. Matsuda et al. studied the relationship between changes in fibrinogen levels and prognosis before and after neoadjuvant therapy for esophageal cancer, and found that patients with elevated serum fibrinogen levels had a worse prognosis after neoadjuvant therapy [27]. The results of this study also suggest that serum fibrin is an important risk factor for the prognosis of breast cancer patients.

Albumin, globulin, total bilirubin and urea are all correlated with cancer prognosis. Studies have shown that low serum albumin expression levels are corelated with poor breast cancer prognosis [28]. Additionally, the increase in the C-reactive protein to albumin ratio is an independent risk factor for resectable non-metastatic breast cancer prognosis [29] and serum protein can predict the postoperative survival rates after breast cancer surgery [30]. There are reports that the albumin/globulin ratio can also predict the postoperative prognosis after surgery for small-cell lung cancer. Similarly, the preoperative albumin/globulin ratio can also predict gastric cancer prognosis [31]. Serum albumin and total bilirubin levels are independent risk factors for non-metastatic breast cancer prognosis [32]. Similarly, it has been reported that serum uric acid is an independent risk factor for advanced gastric cancer prognosis [33, 34]. High serum uric acid is an independent risk factor for breast cancer prognosis [35]. This is also consistent with our research results.

This study was limited by its retrospective nature. First, because the analysis was conducted retrospectively, the data were fixed or unchangeable in the past, which may have led to failure of the subgroup analysis in all dimensions. Second, the predictive machine-learning models may require incremental adjustment, as the predicted effective half-life of historical data for future clinical presentations may be shortened; therefore, more prospective studies are needed. This study does have some over-fitting, although we corrected this in the past by combining regularization and bagging. Furthermore, as this study was a second retrospective analysis, a major defect is that it did not include information about adjuvant treatments and follow-up.

Finally, new support therapy is an important component of breast cancer treatment, which involves giving preoperative systemic treatment to breast cancer patients who are suitable candidates for this, to reduce tumor size and stage. This can provide patients with inoperable breast cancer the opportunity for surgical treatment. It can also allow for breast-conserving surgery in patients who otherwise would not be candidates [36]. Moreover, it can identify tumor drug sensitivity and guide follow-up adjuvant treatment [37]. The treatment plan during follow-up may also be related to breast cancer prognosis. At present, guidelines recommend adjuvant radiotherapy be administered within 6 months after surgery; however, some studies suggest that the earlier the radiotherapy starts, the better the prognosis [38]. Conversely, Caponio et al. [39] analyzed the influence of radiotherapy time on local recurrence and distant metastasis of 615 patients with early breast cancer and found no correlation between postoperative radiotherapy time and local recurrence or distant metastasis risk. Zhang et al. [40] also confirmed that a delay in starting radiotherapy after mastectomy does not increase the risk of local recurrence, distant metastasis, or death.

Conclusion

The findings of our study demonstrate that machine learning can better predict the cancer-specific death outcome of patients undergoing surgical resection for invasive breast cancer. The results of cancer-specific death caused by various factors obtained by the Gbdt algorithm show that the top five important factors are respectively ranked as: regional lymph node metastasis quantity, LDH, triglyceride, plasma fibrinogen, and cholesterol.