Introduction

Chronic kidney disease (CKD), a major public health problem with an increasing incidence and prevalence year by year, affects 700 million people globally [1]. A recent study projecting the future burden of CKD in the United States estimated that the prevalence of CKD (here defined as CKD stages 1–4) among those aged 30 years or older would increase from 13.2% in 2010 to 16.7% in 2030 [2]. In 2012, a survey including 47,204 adults from 13 provinces and cities in China showed that the total prevalence of CKD was 10.8% [3]. It is estimated that the number of dialysis patients in China will increase at a rate of 20–30% per year. Namely, over 400, 000 Chinese patients will develop end-stage renal disease (ESRD) every year, which comprises a large part of the world’s ESRD population. This development will bring a heavy burden to public health and society, leading to a considerable challenge to medical and health undertakings. The cost of dialysis treatment alone for one patient would be approximately $14,300 per year, whereas the per capita disposable income is $1210 in urban areas and $375 in rural areas in China [4]. Early recognition and prevention of potential ESRD is therefore of significant importance.

In 2012, the Kidney Disease: improving Global Outcomes (KDIGO) guidelines recommended reclassifying CKD [5]. The classification divided CKD stage 3 into two subgroups by applying a cutoff point of the estimated glomerular filtration rate (eGFR) (45 mL/ min 01.73 m2). Hence, subjects with CKD stage 3a were considered low risk compared with patients with CKD stage 3b. This new classification was based on a meta-analysis performed in 45 cohorts involving over 1.5 million participants mainly from developed countries [5]. The incidence, prevalence, and progression of CKD vary within countries by ethnicity and social determinants of health, possibly through epigenetic influence [6]. The prevalence of CKD and the prevalence of diabetic CKD have both stabilized in the United States since the early 2000 s, signaling a change in the epidemiology of CKD [7]. There are several populations around the world with an emerging risk of increasing CKD, including China (in the context of rapid urbanization and a rising incidence of diabetes) [7]. Therefore, we cannot simply follow the guidelines from the data from developed countries. In addition, only a few studies have indicated that KDIGO staging is applicable to patients with CKD stage 3 in China [8] and have not observed differences in the prognosis between patients with stage 3a and 3b CKD. It is important to know whether the division of CKD stage 3 is suitable for Chinese patients.

Electronic medical records provide large-scale real-world clinical data for the use in developing clinical decision systems. However, sophisticated methodology and analytical skills are required to handle the large-scale datasets necessary for the optimization of prediction accuracy [9]. With government incentives offered to clinical organizations to transition from paper-based patient information to well-structured and managed digital form, there has been a tremendous explosion in the availability of patient-centric healthcare data. Such data can be leveraged to open new avenues in advancing healthcare by improving patient care and creating new efficiencies in delivering care [10]. Besides, early prediction of deterioration can play an important role in supporting health care professionals, as an estimated 11 percent of hospital deaths follow a failure to promptly recognize and treat deteriorating patients [11]. Machine learning algorithms are well suited to analyze large, complex dataset [12], which can identify information quickly, effectively and explore intrinsic relationship. To verify the staging for Chinese patients with CKD stage 3 and built up an alerting model, we built our CKD3 staging modeling (CSM) approach and evaluated its reliability in a retrospective study involving CKD patients treated at Xiangya Hospital, one of the largest hospitals in South Central China. The CSM approach computes the cutoff point of the eGFR for staging of patients with CKD stage 3 and possible risk factors for progression to ESRD based on the following three components: (1) identifying the cutoff points according to the data distribution; (2) verifying the demarcation point using an eGFR of 40–48 as the dividing points for stage 3a/3b CKD; and (3) assessing the risk factors of stage 3a/3b CKD by using RF analysis. This study is based on the Central South University medical big data project subject platform [13], using Spearman correlation coefficient (SCC) analysis, algorithms including LR, RF, SVMs, and Nnets, to explore whether the KDIGO stage criteria for patients in South Central China with stage 3a/3b CKD are suitable. Moreover, we explored factors that influenced the prognosis of patients with stage 3a/3b CKD with the new criteria.

Materials and methods

Study design

We conducted a retrospective cohort study using the full text of clinical notes in the year when the patients first met the criteria for CKD stage 3 (30 ≤ eGFR < 60 mL/ min 1.73 m2). All the clinical data were extracted from the electronic medical records system (EMRS). The data were analyzed using the CSM system. All identified events were adjudicated through chart review.

CKD stage 3 modeling (CSM)

The CSM approach is a prediction model based on an artificial intelligence core intended to identify a new cutoff point of 43 in patients with CKD stage 3 and to distinguish the different factors related to progression of stage 3a/b CKD to ESRD (Fig. 1). The predictive model at the CSM core is a machine learning model that was trained on a set of almost 50,049 clinical records; records were extracted from the clinical information system. The artificial neural network was trained to predict a suitable cutoff point for CKD stage 3 among patient in South Central China and to identify influential factors for progression of CKD stage 3a/3b to CKD stage 5 based on the parameters listed in (Table 1).

Fig. 1
figure 1

Flow diagram of CKD stage model

Table 1 Patient characteristics in the two categories of patients in the facility level analysis

Population studied

Inpatients and outpatients with an eGFR between 30 and 60 mL/min 1.73 m2 who were treated at Xiangya Hospital between August 1, 2010, and April 1, 2018 were included. One of the criteria for patient’s enrollment was that patients were followed up at least once a year. If the patient had multiple records in a year, each record would be obtained. The time of onset of CKD stage 3 was recorded as the first time CKD stage 3 was diagnosed. The time when the eGFR decreased to less than 15 mL/min 1.73 m2 was also recorded. The eGFR was determined with the CKD Epidemiology Collaboration equation (CKD-EPI) for Chinese patients with CKD [8]. We screened patients in with stage 3 and CKD stage 5, recorded the eGFR and the date they were first included in each cohort, extracted the intersection of both cohorts, and set the date they were first diagnosed with CKD stage 3 before progression to CKD stage 5. Concrete exclusion criteria included the following: acute kidney injury (AKI) (2012 KDIGO guidelines); age < 18 or > 70 years; the first time CKD stage 3 was diagnosed that was later than the time of diagnosis of CKD stage 5; Incomplete clinical data; and hemodialysis, peritoneal dialysis, and kidney transplantation patients.

Study outcomes

ESRD was defined as the initiation of irreversible development of an eGFR < 15 mL/min 1.73 m2. The ultimate ascertainment of eGFR is based on the values from a central laboratory. ESRD events were adjudicated by an independent committee consisting of relevant specialist physicians.

Data collection

We obtained data from the EMRS of Xiangya hospital. We collected information on patient demographics (name, ID, age, sex), diagnosis, accompanying diseases (diabetes, hypertension, and cardiovascular disease) and the laboratory data urine nitrite(NIT), urobilinogen(URO), bilirubinuria(BiL), urine specific gravity (SG), urine white blood cells(WBC), urine vitamin C(Vitamin C), glucosuria(Glu), proteinuria, ketonuria(Ket), urine pH(PH), neutrophil percentage(NeuTP), neutrophil count(NeuT), monocyte percentage(MONOP), monocyte count(MONO), basophil percentage(BASOP), basophil count(BASON), eosinophil percentage (EOP), eosinophil count(EON), mean corpuscular volume (MCV), mean platelet volume (MPV), mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), white blood cell count (WBC), red blood cell volume distribution width (RDW), hematocrit (HCT), red blood cell count (RBC), lymphocyte percentage (LYMPHP), lymphocyte count (LYMPHN), platelet volume distribution width (PDW), thrombocytocrit (PCT), platelet (PLT), hemoglobin (HGB), total bile acids (TBA), total bilirubin (TBIL), total protein (TP), albumin (ALB), globulin(GLB), albumin-to-globin ratio (A/G), direct bilirubin (DBIL), blood high-density lipoprotein cholesterol-to-total cholesterol (HDL/TC), low-density lipoprotein(LDL), alanine aminotransferase (ALT), aspartate aminotransferase (AST), cholesterol(TC), chloride(CL), triglycerides (TG), high-density lipoprotein (HDL), serum creatinine (sCr), urea(UREA), uric acid(URIC), glucose(Glu), calcium(Ca), sodium(Na), potassium(K), and eGFR. We defined baseline laboratory values for each laboratory test as the first available result on or after the first diagnosis of CKD stage 3. Hypertension was defined as a systolic blood pressure (BP ≥ 140 mmHg and/or a diastolic BP ≥ 90 mmHg, or diagnosis of hypertension. Patients were considered to have diabetes mellitus if they had a fasting glucose ≥ 7.0 mmol/L; an HbA1c ≥ 6.5%; or diagnosis of diabetes. If ESRD did not occur by the average time of progression among patients with CKD stage 3, the observation was censored. This study was approved by the Ethics Committee of Xiangya Hospital, and the need for informed consent was waived. We adhered to the Declaration of Helsinki.

Data processing

Firstly, correlation analysis between different CKD stage 3a/3b cutoff point and time progress in the study period was carried out. Secondly, the function cor. test () was used to calculate the PCC, SCC and their related p values. The LR also was used to calculate their related p values. Later, four models were built, including the linear and nonlinear models—LR, RF, SVM, and Nnet, for each CKD stage3a and CKD3b group. Logistic regression was another generalized linear model (GLM) procedure using the same basic formula, but instead of the continuous Y, it was regressing for the probability of a categorical outcome. RF were conducted using the functions random forest () in the package “random forest”. SVM were conducted using the functions svm () in the package “e1071”. Nnet were conducted using the functions nnet () in the package “nnet”. Fivefold cross-validation were used to evaluate the statistical models, LR, RF, SVM, and Nnet. In the fivefold cross-validation, the sample data are randomly partitioned into five equal groups. Each time, one group of data was retained as the validation data for testing the model, and the remaining four groups were used as training data. This process was then repeated five times, with each group used exactly once as the validation data. To reduce variability, five rounds of cross-validation were performed using different partitions, and the validation results were averaged over the rounds. The performance of the model was evaluated based on the comparison between predicted and observed number of patients whether progressed into CKD stage 5. Thirdly, CSM continued to search risk factors of progression to ESRD in CKD stage 3a/3b patients by RF.

Results

After applying the inclusion and exclusion criteria, we identified 1090 patients who constituted the analytic cohort (Table 1). Among them, 455 were confirmed to have developed ESRD during follow-up (positive group). The median follow-up time was 4.0 years [95% confidence interval (CI), 4.295–4.489]. This work focuses on the use of machine learning to predict disease risk and model the contributing factors learned from an electronic health record dataset.

43 mL/ min 1.73 m2 may be the new cutoff point for predicting CKD stage 3 progression to ESRD among patients in South Central China

Hypothesis tests between the patients who progressed CKD stage 5 and those who did not were carried out for each feature. We analyzed the relationship between the eGFR and progression time by a scatter density map and contour line and found that two high density regions could be distinguished when the eGFR was 43 mL/min⋅1.73 m2 and when the eGFR was 45 mL/min⋅1.73 m2 (Fig. 2).

Fig. 2
figure 2

Scatter plot for finding the eGFR cutoff point. On the horizontal axis was the time of progression (days), and on the vertical axis was the patient's first eGFR value. The distribution rule for the eGFR value and the time of progression was given through the density curve, and it could be seen that the region with the highest density was close to the region with a distance of 45, and the region with the highest density was 43, which could better distinguish the density region

Furthermore, all patients’ samples were split into CKD stage 3a and CKD stage 3b groups by different eGFRs (range from 40 to 48 mL/min⋅1.73 m2). PCC and SCC were used to find the best values of eGFR to distinguish CKD stage 3a from 3b to progression to CKD stage 5. It is showed that when the eGFR is 43 mL/ min⋅1.73 m2, the correlation coefficient is the largest (Table 2). Further, we used a logistic regression model to measure the eGFR divided CKD3 patients into two groups and the time of progression to CKD stage 5. We found that when CKD3 patients were classified by an eGFR of 43 mL/min⋅1.73 m2, the regression coefficient and significance were prominent (Fig. 3, Table 3).

Table 2 Correlation between the eGFR and progression time by pearson and spearman correlation analysis
Fig.3
figure 3

Correlation between the eGFR and progression time with logistic regression measures. The figure shows the regression results. The estimated value of each coefficient is a point, the bold line represents a confidence interval for the standard error, and the thin line represents a confidence interval for twice the standard error. The vertical line is zero. To evaluate statistical significance, we evaluated whether the double confidence interval contained 0; if it did not, the result was statistically significant

Table 3 Correlation between the eGFR and progression time by logistic regression measures

According to an eGFR of 40 and 48 mL/min⋅1.73 m2 as the dichotomy for stage 3a and CKD stage 3b, respectively, using four types of algorithms to distinguish stage 3a and 3b CKD, a classification model and model performance comparison reference appendix were established. Based on the eGFR cutoff point of 43 mL/min⋅1.73 m2, Random Forest model performed the best for distinguishing stage 3a and 3b CKD patients who would progress to ESRD. As shown in the figure below, for stage 3a and 3b CKD, this model had an accuracy of 85% and 77%, respectively, and an area under the curve (AUC) value of 88% and 83%, respectively, which includes all the variable values in (Table 1) (Fig. 4a, b).

Fig. 4
figure 4

a Comparison of unified models with different algorithms in CKD stage 3a patients. b Comparison of unified models with different algorithms in CKD stage 3b patients. The figure shows the regression results. The estimated value of each coefficient is a point, the bold line represents a confidence interval for the standard error, and the thin line represents a confidence interval for twice the standard error. Random forest, support vector machine, logistic regression, and neural network algorithms were used to construct a classification model to predict whether patients with 43 < eGFR < 60 mL/min⋅1.73 m2 would progress to CKD stage 5. The random forest model had the largest AUC value (0.8783), indicating that the model has the best prediction effect. b The estimated value of each coefficient is a point, the bold line represents a confidence interval for the standard error, and the thin line represents a confidence interval for twice the standard error. Random forest, support vector machine, logistic regression and neural network algorithms were used to construct a classification model to predict whether patients with 30 ≤ eGFR < 43 mL/min⋅1.73 m2 would progress to CKD stage 5. The random forest model had the largest AUC value (0.8292), indicating that the model had the best prediction effect

Screening predictors of CKD stage 3a/3b progression to CKD stage 5 by RF

After establishing a reliable forecast model, we used an RF to clarify the different risk factors for progression of stage 3a/3b CKD to CKD stage 5. The risk factors for stage 3a/3b CKD were explored by modeling with an RF at an eGFR cutoff point of 43 mL/min 1.73 m2, and the results of the importance assessment and analysis of the model parameters are given (Fig. 5). The higher the value of mean decrease accuracy or mean decrease Gini score was, the higher the importance of the variable in the model.

Fig. 5
figure 5

a Important variables in the model of patients with CKD stage 3a who progressed to CKD stage 5 by random forest. b Important variables in the model of patients with stage 3b CKD who progressed to CKD stage 5 by random forest. Serum albumin, proteinuria, serum TP, serum TBIL, serum DBIL serum A/G, blood HGB, serum Ca, eGFR, blood HCT, serum TC, serum ALT, serum HDL, urine SG, serum SCR, serum urea, age accounted for the progression from stage 3a CKD to CKD stage 5. Serum SCR, eGFR, serum TP, serum TC, serum urea, EOP, serum ALB, blood MCH, serum TBIL, Diabetes, blood EON, serum Na, serum HDL, serum Cl, proteinuria, age accounted for the progression from stage 3b CKD to CKD stage 5

The common influencing factors of stage 3a/3b CKD progression to CKD stage 5 included serum SCR, eGFR, serum TP, serum TC, serum urea, serum albumin, serum TBIL, serum HDL proteinuria, age. Furthermore, serum DBIL, serum A/G, blood HGB, serum Ca, blood HCT, serum ALT, urine SG accounted for the progression of CKD stage 3a to CKD stage 5. The contributing factors for CKD stage 3b progression to CKD stage 5 included blood EOP, blood MCH), diabetic kidney disease, blood EON, serum Na, serum Cl.

Incidence rates of ESRD events according to the cutoff of the eGFR of 43 mL/min⋅1.73 m2

The incidence rates of ESRD events according to the cutoff of the eGFR of 43 mL/min⋅1.73 m2 are shown in (Table 4). During the median follow-up of 4.0 years (95% CI, 4.295–4.489), higher incidence rates of ESRD events were observed in CKD with a decreased eGFR (Table 4, Fig. 6, p for log-rank test < 0.001).

Table 4 Relationship between the cutoff point of the eGFR for stage 3 CKD patients of 43 ml/min·1.73 m2 and ESRD event rates
Fig. 6
figure 6

Kaplan–Meier curve for ESRD events according to the cutoff of eGFR for stage 3 CKD patients of 43 mL/min 1.73 m2. The survival curve of patients with stage 3a CKD was smoother than that of patients with stage 3b CKD, indicating that the prognosis of patients with stage 3a CKD was better than that of patients with stage 3b CKD

Discussion and conclusion

The KDIGO guidelines on CKD represent an extraordinary effort to summarize and synthesize evidence together with a thoughtful expression of the best practices and opinion [14]. One of the meaningful suggestions was the division of stage 3a and 3b CKD. It was suggested that it would be clinically sound to subdivide CKD stage 3 into stages 3a (45–59 mL/ min 1.73 m2) and 3b (30–44 mL/ min 1.73 m2), as these two ranges may be associated with different clinical patterns and risks. It has recently been shown that patients with CKD and an eGFR < 45 mL/ min 1.73 m2, particularly older patients, experience faster disease progression [15]. Patients with CKD stage 3b should probably be referred earlier for specialized renal care [16]. Some have recommended that people with an eGFR category CKD stage 3a without associated markers of kidney damage (proteinuria or hematuria) should not necessarily be considered to have CKD and should be considered for further evaluation and referral according to the clinical judgment of the health care provider [17, 18].

CKD is a global health challenge, especially in low- and middle-income countries. China is a large developing country with different health care and primary care structures, and some recommendations by the international guidelines’ groups might not be relevant to the Chinese population. First, the prevalence of CKD stage 3 was 1.6% in China compared with 7.7% in the USA and 4.2% in Norway [3]. The findings described rise in the prevalence of diabetes in China [19, 20], a signal strongly forewarning a growing epidemic of CKD in China in the upcoming years to decades, perhaps analogous to trends seen in the United States from the 1980 s to early 2000 s [21]. Wen et al. reported the prevalence of CKD and its stages among the general population in Taiwan [22], where the ethnicity and living habits were the same as in Mainland China, but the economic development was better, and they found a higher proportion of lower eGFR (CKD stage 3 or worse) than that reported among the population in Mainland China [3]. With respect to CKD in China, there were twice as many people with proteinuria than those with a low eGFR, while in the US, the prevalence difference in a low eGFR and proteinuria was much smaller than in China [23]. Taken together, with the different prevalence of CKD, racial differences, economic development, genetic, and environmental backgrounds between China and Western countries, we should evaluate the guidelines according to our actual situation rather than simply adhering to the recommendations.

As the adoption of electronic health records continues to rise and a generation of individuals has their entire health histories stored electronically, this approach provides a novel way to gain potential insights about the disease risk as a natural byproduct of care delivery and electronic health record documentation [24]. Mathematical and statistical tools developed in the field of artificial intelligence (AI) and machine learning are well poised to assist clinical researchers in deciphering complex predictive patterns in healthcare data [25]. It is challenging for humans to directly analyze these massive data; this is not only because of the massive time required and cares needed to avoid human errors, but also the ability to derive the insights or information in depth. Clearly, machine learning holds nonparallel advantages over humans in these domains [26]. Unlike the previous CKD stage 3 classification studies, this is the first study to use an unbiased machine learning approach using text from clinical notes to identify appropriate cutoff points for patients with CKD stage 3, determine different risk factors for CKD stage 3a and 3b, more importantly, build a model to predict the possibility of progression to ESRD in a predetermined period. The face validity of this approach was confirmed by different calculation methods of AI. This study also conducts proposed methods to extract insights about performance trends that cannot be easily extrapolated using standard analyses and treats various influencing factors according to the model set by the CSM approach.

In this computer-based retrospective analysis, we confirmed that it is clinically significant to divide CKD stage 3 patients into CKD stage 3a and 3b. More importantly, machine learning, when applied to predictive modeling, can determine patterns of risk factors useful for improving prediction quality [27]. In this study, the identification of several well-established risk factors for ESRD in CKD stage 3 patients, including age [1], proteinuria [1], diabetic kidney disease [1], eGFR [1], serum ALB [2], creatinine [28], blood urea [29], hematocrit [2], serum cholesterol [30], HDL cholesterol [31], HGB [27], TBIL [32], DBIL [33], serum ALT [34], serum Na [35], serum Cl [35],and serum calcium [36] were indicated by machine learning. In addition, the machine learning method also identified some risk factors that have not been previously described, such as A/G, MCH, urine SG, TP, EOP, and EON future research is needed to determine the possible role of these factors in the progression of CKD.

In addition, there are different factors associated with progression from stage 3a and 3b CKD to CKD stage 5. Apart from common factors of CKD stage 3 progression to CKD stage 5, serum DBIL, serum A/G, blood HGB, serum Ca, blood HCT, serum ALT, urine SG accounted for the progression from CKD stage 3a to CKD stage 5. The contributing factors for CKD stage 3b progression to CKD stage 5 include blood EOP, blood MCH), diabetic kidney disease, blood EON, serum Na, serum Cl. These findings may remind clinicians to pay attention to different factors in patients with stage 3a and 3b CKD.

This is the first study to use an unbiased approach using text from clinical notes to identify predictors of progression to ESRD among CKD stage 3 patients. Our work confirmed that it is reasonable to divide CKD stage 3 into stage 3a and 3b. Besides, eGFR cutoff point of 43 mL/min 1.73 m2 is a suitable cutoff point by predicting progression to ESRD in Central South Chinese patients using different machine learning methods. More important, our findings may provide clinical proof of the beneficial effects of deploying the CSM approach in everyday practice as part of routine nephrological practice. As the adoption of electronic health records continues to rise and a generation of individuals has their entire health histories stored electronically, this approach provides a novel way to gain potential insights about disease risk as a natural byproduct of care delivery and electronic health record documentation [24]. As systems analytics, big data, and machine learning, among others, come online and become more widely available, we may be able to tackle CKD more holistically, efficiently, and satisfactorily.

This study has some limitations. The primary limitation of this study is that its findings are drawn from a single tertiary hospital, which may have idiosyncrasies in documentation style and patient characteristics that may differ from other institutions. Validating this analysis in other cohorts is needed. This approach was successful in translating the clinical narrative into a tool for the discovery of possible predictors that have not been previously linked to kidney failure. Second, if low-risk patients were systematically excluded from these cohorts due to lack of follow-up creatinine testing, then estimates from the resulting models could overestimate risk of advanced chronic kidney disease. Third, the kidney disease outcome evaluated in this paper was progression to ESRD, future prospective studies may also include death or cardiovascular events as other outcomes either. Fourth, based on the retrospective study, we cannot collect the treatment information correctly. We will conduct prospective research to collect more detailed data to replicate our findings and approach in multicenters and determine the cutoff point of different stage of CKD in the future.

In summary, our findings confirm a new cutoff point for CKD stage 3 by computational intelligence, which is different from a previous study. The CSM approach provides a novel tool to identify the different influencing factors for stage 3a/3b CKD progression to CKD stage 5. The CSM approach may be adapted and used in the management of other chronic diseases in which international guidelines require confirmation in different populations.