Introduction

Colorectal cancer is the third most common cancer and the second leading cause of cancer-related deaths globally, accounting for approximately 10% of all cancer cases and deaths [1]. In 2018, the number of new cases and deaths from colorectal cancer worldwide was 1.9 million and 540,000, respectively, and it is projected that its global burden will increase by 60% by 2030, with the number of new cases and deaths reaching 2.2 million and 1.1 million, respectively [2, 3]. Statistically, the 5 year relative survival rate for colorectal cancer cases in the United States can be as high as 90% when detected at an early stage and only 14% when distant metastases are present [4]. Therefore, early screening and identification of colorectal cancer patients is important for reducing colorectal cancer mortality and prolonging the quality and duration of patient survival. Colonoscopy is currently the gold standard for the diagnosis of colorectal cancer, but its shortcomings such as invasiveness, complex bowel preparation, certain risks, and high cost limit its use in large-scale population screening [5, 6]. Fecal noninvasive testing and serum tumor marker testing are currently commonly used noninvasive testing methods for CRC in the clinic [7, 8], and compared with feces, people’s compliance with blood sample collection is better in practical application [9]. Serum tumor markers are mainly abnormally elevated when tumor-related genes are expressed or when the body recognizes tumors, so they have important value in both the diagnosis and prognosis of tumors and are widely used in clinical practice [10]. In recent years, with the rapid development of information technology, more and more researchers have applied artificial intelligence technology to disease diagnosis and prognosis prediction and achieved good results [11,12,13]. This research undertakes a thorough examination of the diagnostic efficacy of four serum tumor markers (CA199, CEA, AFP, CA125), which are prevalently utilized in the clinical detection of colorectal cancer (CRC), evaluating their performance both singularly and synergistically. Building on this, the study endeavors to craft a machine learning-based diagnostic model for CRC. This advanced model will integrate the aforementioned tumor markers with clinical blood biochemistry indicators that are readily obtainable from patients. By conducting a comparative analysis between this innovative model and the traditional tumor marker approaches, the study aims to uncover a more efficacious diagnostic route for CRC. This endeavor is set to contribute significantly toward the enhancement of early screening procedures and the development of more targeted management strategies for CRC. Ultimately, the study's objective is to identify and deploy a dependable model capable of enabling the early detection and timely therapeutic intervention for individuals diagnosed with CRC, thereby potentially improving patient outcomes.

Materials and methods

Participants

Internal validation set: Case group: retrospective collection of colorectal cancer patients who attended the Department of Anorectal and Gastroenterology of the First Affiliated Hospital of Dalian Medical University from January 2019 to December 2021. Inclusion criteria: 1. patients diagnosed with primary colorectal cancer by pathologic findings; 2. aged between 18 and 85 years old; 3. none of them had undergone surgery and radiotherapy for related diseases; 4. all of them were tested for four serum markers; 5. consent was obtained from the patients and their family members and an informed consent form was signed. Exclusion criteria: 1. pregnant and lactating women; 2. accompanied by acute and critical illnesses or organ failure; 3. serious deficiencies in blood counts and serum markers in the medical records; and 4. 800 cases were finally included.

Control group: retrospective collection of people who visited the health management center of the First Hospital of Dalian Medical University during the same period. Inclusion criteria: 1. people who underwent colonoscopy and serum markers; 2. aged between 18 and 85 years old; 3. consent from the person and family members and signing of the informed consent. Exclusion criteria: 1. patients diagnosed with colorectal cancer by colonoscopy; 2. pregnant and lactating women; 3. those with serious deficiencies in blood routine and serum markers in their medical records; and 4. 697 controls were finally included.

External validation set: Case group: retrospective collection of colorectal cancer patients who attended the Department of Anus and Intestines of the First Hospital of Dalian Medical University in 2022.

Control group: retrospective collection of people who visited the health management center of the First Hospital of Dalian Medical University during the same period. The inclusion and exclusion criteria of the external validation set were the same as those of the internal validation set.The study was approved by the Ethics Committee of the First Hospital of Dalian Medical University.

Data collection

The study collected comprehensive data encompassing both demographic and health-related metrics. Demographic information included gender and age. Health-related data comprised a wide range of measures:

Basic health indicators: Body Mass Index (BMI), Systolic Blood Pressure (SBP), and Diastolic Blood Pressure (DBP).

Blood composition analysis:

Complete blood count: White Blood Cell (WBC), Neutrophil (NEUT), Lymphocyte (LYMPH), Red Blood Cell (RBC), Hemoglobin (Hb), Hematocrit (HCT).

Blood cell metrics: Red Cell Distribution Width—Standard Deviation (RDW-SD), Red Cell Distribution Width—Coefficient of Variation (RDW-CV), Platelet (PLT), Platelet Distribution Width (PDW), Mean Platelet Volume (MPV), Platelet Large Cell Ratio (P-LCR).

Kidney function tests: Creatinine (Cr) and Uric Acid (UA).

Liver function tests: Glutamic Acid (Glu), Alanine Aminotransferase (ALT), Aspartate Aminotransferase (AST), Albumin (ALB), Gamma-glutamyl Transferase (GGT), Total Bilirubin (TBIL), Direct Bilirubin (DBIL).

Cancer markers: Carbohydrate Antigen 19–9 (CA19-9), Carcinoembryonic Antigen (CEA), Alpha-fetoprotein (AFP), Cancer Antigen 125 (CA125).

This detailed collection of data aims to provide a holistic view of the participants’ health, facilitating a nuanced analysis of the relationship between these variables and colorectal cancer.

Sample collection and testing methods

Fasting venous blood of 2 ~ 4 ml was collected, and the serum was separated by centrifugation at 3000 r/min for 10 min and stored at − 20 ℃ in the refrigerator for examination. Blood routine items were detected by Japanese sysmexXN-10 instrument; blood glucose, liver function and kidney function were detected by Hitachi 7600–210 automatic biochemical analyzer; four serum markers were detected by electrochemiluminescence method, and the instrument was Myeri CL-6000i. AFP (< 20 ng/ml), CA125 (< 35 U/ml), beyond the above range is considered positive.

Data cleaning

The outliers in the data were assigned as NA, and BMI was classified into categorical variables (< 18.5, 18.5–24.9, ≥ 25) according to WHO standards. Missing values were interpolated using the “MissForest” R software, which was proposed by Stekhoven in 2012 and is an iterative interpolation method based on random forests, which essentially treats missing value interpolation as a prediction problem and can simultaneously deal with mixed data consisting of both categorical and continuous variables, and is superior to the K-nearest-neighbors, MICE package-based chain interpolation, and other interpolation methods [14, 15]. For the interpolated data, multicollinearity test was performed using the “performance” R software to determine whether multicollinearity exists among independent variables by variance inflation factor (VIF), and the variables with VIF < 5 were included in the study [16].

Model construction

In this study, several indicators, including area under the ROC curve AUC, accuracy, sensitivity, specificity, precision, and F1 score, were used to assess the efficacy of four serum tumor markers, CA199, CEA, AFP, and CA125, for diagnosing colorectal cancer alone and in combination. In general, we believe that the higher the AUC value, the better the differentiation of the model, and when the AUC ≥ 0.9, the model performs well; when the AUC is between 0.8 and 0.9, the model performs well; when the AUC is between 0.7 and 0.8, the model performance is fair; and when the AUC < 0.7, the model performance is poor [17, 18]. Then, the variables were screened using the stepwise regression (backward) method, which was implemented by the stepAIC function in the “MASS” R package, which was based on the AIC (Akaikei Information Criterion), in which all independent variables were firstly put into the model, and then the insignificant variables were gradually eliminated, so that the fewest independent variables were obtained, which resulted in the lowest AIC value and the best model performance [19]. The screened variables were then used to construct the machine learning model. In this study, six machine learning algorithms, logistic regression (LR), support vector machine (SVM), gradient boosting machine (GBM), plain Bayes (NB), artificial neural network (ANN), and random forest (RF), were selected to construct the model, and the dataset was randomly divided into the training set and the test set according to the ratio of 7:3, and the tenfold cross-validation was used to internally validate the model. The predictive performance of the models was evaluated using several indicators, including area under the ROC curve AUC, accuracy, sensitivity, specificity, precision, and F1 score, and compared with the diagnostic performance of tumor markers, and the better performing model was entered into the external validation set for validation.. The machine learning models in this study were all constructed by the “caret” package in R4.3.2, and the ROC curves were plotted by the “pROC” package.

Statistical analysis

This study used R4.3.2 to process and analyze the data, and the measurement information was expressed as mean ± standard deviation (x ± s), and t test was used for comparison between groups; the count information was expressed as percentage (%), and χ2 test was used for comparison between groups. All statistical tests were two-sided, and the differences were considered statistically significant at P < 0.05.

Results

Baseline information

The flow chart for this study is shown in Fig. 1. As shown in Table 1, 800 CRC patients and 697 non-patients were included in this study. Among the demographic variables, the differences in age and BMI between the two groups were statistically significant; although there was no difference in gender, the proportion of males was significantly higher than that of females in both groups. Among the four serum markers, three of them, CA199, CEA, and CA125, were significantly associated with CRC, and their mean values were significantly higher in the case group than in the control group. Among the rest of the biochemical markers, all of them were significantly correlated with CRC, except for RDW-CV, PLT, PDW, Cr, AST, and GGT, the differences of which were not statistically significant between the two groups.

Fig. 1
figure 1

The flowchart of this study

Table 1 Baseline characteristics of colorectal cancer patients compared with healthy controls

Test of covariance

Multiple covariance test was performed on the independent variables and it was concluded that the VIF values of all the independent variables were less than 5 (Fig. 2) and there was no multiple covariance among the variables, so all the variables could be included in the study normally and no need to be excluded.

Fig. 2
figure 2

Tests for multicollinearity between variables

Serum marker model

As shown in Fig. 3, when the four serum markers were utilized to predict CRC alone, only the area under the ROC curve of CEA (AUC = 0.79) had an AUC of 0.7 or more, and the model performance was fair, followed by CA199 (AUC = 0.643), and the two indicators of CA125 (AUC = 0.592) and AFP (AUC = 0.553) had a poorer ability to predict CRC. The combination of four serum markers to predict CRC was better than the use of a single tumor marker when predicting, the AUC value reached more than 0.8, in Table 2, the AUC value, accuracy, specificity and precision of the model combining the four tumor markers were higher than that of the four single-prediction models.

Fig. 3
figure 3

ROC curves of four tumor markers alone and in combination to predict colorectal cancer

Table 2 Comparison of the efficacy of four tumor markers alone and in combination for the prediction of colorectal cancer

Machine learning models

The backward stepwise regression method was applied to screen out 17 variables (Table 3), which were incorporated into six machine learning models. The ROC curves and model evaluation metrics of the training set are shown in Fig. 4A and Table 4, respectively. In the training set, except for RF, which showed overfitting, the AUC values of all the models reached more than 0.8, among which, the three models, GBM, SVM, and LR, with AUC values of 0.9 or more, had excellent prediction performance. GBM had the highest AUC value in the training set, reaching 0.945, with the best model performance; SVM was the second highest, at 0.936; NB had relatively poor performance, at 0.865.The ROC curves and model evaluation metrics for the test set are shown in Fig. 4B and Table 5. In the test set, the AUC values of GBM and RF were both 0.931, but the accuracy, sensitivity, and F1 score of GBM were higher than those of RF, so GBM was the best model for diagnosing CRC.RF had the highest specificity and accuracy of the six models, and its prediction performance was only second to that of RF.GBM was the best model for diagnosing CRC. highest among the six models, and the predictive performance was second only to GBM. The AUC values of the two models, ANN and NB, still did not reach 0.9, and the predictive performance was relatively poor. The ROC curves and model evaluation indexes of the external validation set are shown in Fig. 3C and Table 6.In the external validation set, the AUC value of GBM is still the highest among all the models, except that the AUC value of RF is slightly lower than that of GBM, but the remaining evaluation indexes, such as accuracy and specificity, are all the highest among all the models. In conclusion, after internal and external validation, the diagnostic ability of GBM and RF for CRC is more prominent and has certain extrapolation ability. Among the six machine learning models, the variables in the top five in terms of variable importance are CEA and ALB, followed by age, which is in the first place in terms of importance contribution in all five models except ANN; DBIL and HCT are in the top order of contribution in several models, and they are important variables in the prediction of CRC, as shown in Supplementary Fig. 1.

Table 3 Variables screened by applying stepwise regression (backward) method
Fig. 4
figure 4

Six machine learning models to predict ROC curves for colorectal cancer. A Training set. B Test set. C External validation set; LR logistic regression, GBM gradient boosting machine, SVM support vector machine, NB naive Bayesian, ANN artificial neural network, RF random forest

Table 4 Comparison of the efficacy of six machine learning algorithms in the training set for predicting colorectal cancer
Table 5 Comparison of the efficacy of six machine learning algorithms for predicting colorectal cancer in the test set
Table 6 Comparison of the efficacy of six machine learning algorithms for predicting colorectal cancer in the external validation set

Discussion

Colorectal cancer, as one of the most common malignant tumors, is known for its high morbidity and mortality, which brings a huge burden to patients and society [20, 21]. Early screening and diagnosis are of profound significance in reducing the morbidity and mortality of colorectal cancer. As a noninvasive, economical and conveniently sampled test, serum tumor marker assay is now commonly used in the clinic for screening and diagnosis of various types of tumors and prognostic assessment [22, 23]. However, an increasing number of studies have found that tumor markers have low sensitivity or specificity when utilized for cancer diagnosis [22, 24], and thus their ability to serve as an independent screening tool for malignant tumors remains to be considered. Carcinoembryonic antigen (CEA) is the most widely used tumor marker for colorectal cancer that has been identified, and was first demonstrated in human colorectal adenocarcinoma by Gold and Freedman in 1965 [25].ASCO recommends that CEA be used as an important factor in performing postoperative surveillance and prognostic evaluation of colorectal cancer but should not be used in the early screening diagnosis of CRC, due to its diagnostic sensitivity is low and can lead to excessive occurrence of false positives [26]. This is consistent with our findings that although CEA was the most well differentiated tumor marker for independent diagnosis of CRC in this study, with an AUC value close to 0.8, its sensitivity was only 0.589, the lowest among the four tumor markers. In addition, CEA is not only found in CRC patients, but also in esophageal, gastric, and breast cancers, which can also cause elevated serum CEA levels [27], and in non-cancerous diseases such as hepatitis and pancreatitis [28]. Therefore, CEA is often used in conjunction with other tumor markers or as an adjunct to diagnosis rather than as an independent diagnostic tool.

In addition to carcinoembryonic antigen (CEA), commonly used tumor markers for colorectal cancer include carbohydrate antigen CA19-9 and CA125 [29, 30]. Despite the disadvantages of low sensitivity for early identification and inability to effectively differentiate from benign diseases, CA199 is still the only tumor marker designated by the FDA for monitoring pancreatic ductal carcinoma in clinical practice [31]; whereas CA125 is mainly used for screening and monitoring of patients with ovarian cancer [32], and CA199 and CA125 also have certain clinical diagnosis and prognosis of CRC. Previous studies have demonstrated [33, 34] that the efficacy of CEA in the diagnosis of CRC is superior to that of CA125 and CA199, and our study came to a similar conclusion that the AUC value, accuracy, and specificity of CEA were higher than those of other tumor markers. Therefore, both are not significant when used alone for CRC diagnosis and are usually used as a complement to CEA or in combination with other tumor markers. Alpha-fetoprotein (AFP) is a glycoprotein produced by the fetal liver and yolk sac. In healthy adults, levels of AFP are usually low. However, it is significantly elevated in the serum of patients with hepatocellular carcinoma, and therefore, in clinical practice, AFP is mainly used as an important tumor marker in the diagnosis and prognostic assessment of hepatocellular carcinoma. In addition to this, AFP has also been used to monitor other types of cancer such as gastric and colorectal cancers [35]. In this study, AFP differed from the other three tumor markers in that its value did not differ significantly between the case group and the control group, and its AUC value for predicting CRC alone was only 0.553, which is a poor predictive performance and is only used in combination with other tumor markers for the diagnosis or monitoring of CRC.

With the development of AI technology, more and more researchers are integrating it into the practice of disease diagnosis and treatment. Our study constructed six machine learning models using common demographic and laboratory indicators in the clinic, and compared their diagnostic efficacy by AUC value, accuracy, sensitivity, specificity, etc., and finally selected the best-performing gradient boosting machine model. In addition to this, we also analyzed the variable importance of each of the six machine learning models. In the best performing GBM model, the top five variables in terms of importance were Age, ALB, CEA, HCT, Hb. In our study, age was the most important risk factor for colorectal cancer and the incidence increased with age. This is in agreement with the conclusion reached by USPSTF [36] and since most of the new cases were above 45 years of age, 45 years was set as the age node for which colorectal cancer screening is recommended.ALB is synthesized mainly by the liver and is an important protein in plasma and is often used clinically as a measure of the nutritional and health status of patients [37]. In a study by Heys et al. [38], pre-treatment serum albumin concentration could be used as an independent prognostic indicator for colorectal cancer, suggesting that we can include albumin in the screening of prognostic and screening markers for colorectal cancer. In particular, ALB was found to be of high importance in our study, only after age. This may be related to the effect of tumor burden on the status of the body, and attention to changes in ALB in cancer screening is important for the diagnosis of colorectal cancer. Colorectal cancer patients are often associated with the development of anemia [39], and both HCT and Hb are diagnostic markers of anemia, and a decrease suggests a risk of anemia [40, 41]. Ben et al. [42] combined HCT and Hb with several other laboratory indicators, respectively, to construct a prediction model for sporadic colorectal cancer, and after linear correction, the AUC value of the HCT model reached 0.76, and the Hb model had an AUC value of 0.80, indicating that both had good predictive value for colorectal cancer.

Our study has several limitations. First, this study only applies external validation of the models with populations attending the same hospital at different times of the day, and lacks validation from different hospitals or districts, which may affect the extrapolation ability of the model to some extent. Second, the variables in this study were mainly laboratory indicators and did not incorporate information on lifestyle, dietary habits, and past medical history; in future, it is hoped that such variables can be further collected and incorporated into the model to improve the predictive ability of the model and expand the scope of model application.

Conclusion

In summary, compared with the traditional serological tumor markers, the machine learning model shows more excellent performance in colorectal cancer diagnosis, with the gradient boosting machine model as the best choice. When this model is applied to large-scale population screening, it can more accurately distinguish colorectal cancer patients from healthy people, provide doctors with reliable diagnostic basis, and provide important support for the rational allocation of medical resources. In addition, ALB, HCT and Hb, as very common and economical tests in clinical practice, show similar predictive efficacy to that of tumor markers in the prediction of CRC, which will be of great value in the prediction and prognosis of CRC in the future.