Keywords

1 Introduction

With the development of medicine and the application of Smart healthcare, many diseases are under control. But the number of trauma patients is increasing, and the number of deaths is also increasing. Trauma has become the first cause of death for patients between 1 and 44 years old [1]. Trauma-related infection is one of the most common complications of trauma patients. Due to the difference in traumatic environment and degree of trauma, most trauma patients will have different degrees of infection, among which sepsis is the result of out-of-control infection of trauma patients in the late stage. Difficulty in predicting, diagnosis, and treatment are the reasons for the extremely high mortality of sepsis. It can highly improve the treatment rate of sepsis if sepsis prediction in the early stage. Many researchers applied Smart healthcare in the study of sepsis, combined with the computer, big data, and clinical medicine, analyze the risk factors for sepsis, trying to find out the relationship between the specific indicators and sepsis, focus on the point in a certain time of sepsis judgment, lack of the forecast on time interval judgment [2]. This paper will focus on the possibility of sepsis in trauma patients over time. Combined with the MIMIC-III database, this paper will conduct data mining and statistical analysis, and machine-learning algorithms to construct time window prediction and risk scoring tools for trauma-related sepsis to reduce the risk of sepsis and improve the working efficiency of medical staff.

2 Methods

2.1 Study Population and Data Sources

All data used in this study obtained from the Medical Information Mart for Intensive Care III database (MIMIC-III). The data are patients aged 18 years or older who had been in the ICU for 4 h or more due to trauma. Blood culture should be performed within 24 h if the patient uses antibiotics first. If the patient is performed within blood culture first, antibiotics should use within 72 h, and the priority project time recorded as Tsuspicion. When the SOFA score of the patients at 12 h after Tsuspicion minus the SOFA score at 24 h before Tsuspicion is higher than 2, the patients are identified as having sepsis, that is, the experimental group, otherwise for the control group.

2.2 Data Processing

The SOFA score is used as the inclusion and exclusion criteria of the research experiment. Data is preprocessed by data transpose, outlier processing, missing value analysis, and data filling. Missing no module of Python is used as the main tool for missing value analysis, and indicators with a missing ratio of more than 80% are removed. And then time series data are filled based on two dimensions, namely linear interpolation and distance filling.

2.3 Feature Selection and Machine Learning

Feature Selection is the preliminary step of machine learning and data mining, and it is a process of data preprocessing. It eliminates redundant or irrelevant features to identify the most important features, thus reducing the complexity of the problem [3]. In this study, the greedy algorithm is used to design a feature selection algorithm, and the XGBoost algorithm is used to select features of 35 indicators by sequential forward search strategy search. XGBoost has good anti-over-fitting characteristics and high computational efficiency [4]. The tree model of XGBoost is characterized by providing a basis for quantitative feature selection and forming encapsulated feature selection. The time-series data of trauma-related sepsis are input into the key indicator screening model for iteration, and the results of each iteration are recorded to select the index with the highest performance.

Compared with the black-box model (uninterpreted algorithm taking neural network as an example), the Decision Tree is based on if-then-else rules and is easier to understand, realize, explain and visualize [5]. The neural network (the black-box model representation) has certain defects: difficult to optimize, result in the local-optimal solution rather than the global-optimal solution, and low generalization leads to overfitting problems, etc. To sum up, this study uses the decision tree algorithm to build the time window prediction model. This study also uses the random forest and Boosting method which is derived from the decision tree to carry out multiple groups of experiments, to improve the accuracy of time window prediction of trauma-related sepsis [6].

In this study, grid parameter iteration is used for parameter adjustment. Given a set of parameters, the enumeration search method is used to iterate over all possibilities to select the best result (Fig. 1).

Fig. 1.
figure 1

Time prediction model parameter adjustment and grid search setting.

The logistic regression model is a multivariate statistical method to study the relationship between explanatory variables and observed results. In this study, the scoring tool is based on the key index set, using the multi-factor logistic regression method and the clinical grading consensus of the indicators to quantitatively calculated the severity of the patient's illness, namely the score. And then the score is corresponding to the outcome probability. The specific steps of model construction are as follows:

  1. 1)

    Calculate the risk index regression coefficient β of each index;

  2. 2)

    Combined with the inherent medical knowledge to determine the scoring threshold of each index, and determine the reference value \({w}_{ij}\) of each group;

  3. 3)

    The basic risk reference value \({w}_{iREF}\) of each risk indicator is determined. In the subsequent scoring model construction, \({w}_{iREF}\) is recorded as 0 points, and when it is higher than \({w}_{iREF}\), it is recorded as positive points; otherwise, the higher the score, the higher the risk is.

  4. 4)

    Calculate the distance D between the reference value \({w}_{ij}\) of each risk indicator and basic risk reference value \({w}_{iREF}\):

    $$D=\left({w}_{ij}-{w}_{iREF}\right)*{\beta }_{i}$$
    (1)
  5. 5)

    Set constant B, the change value of the index corresponding to the change of 1 point in the risk scoring tool;

  6. 6)

    Calculate the score \({Points}_{ij}\), of each group of risk indicators, and round the calculated value as the corresponding score value of this group:

    $${Points}_{ij}=\frac{D}{B}=\left({w}_{ij}-{w}_{iREF}\right)*{\beta }_{i}/B$$
    (2)
  7. 7)

    Calculate the total score and risk prediction probability:

$$\widehat{p}=\frac{1}{1+\mathrm{exp}({\beta }_{i}{X}_{i})}$$
(3)

3 Results

This study uses PostgreSQL to extract data from the MIMIC-III database. And finally obtained data of 177 patients in the experimental group and 369 patients in the control group, with 35 various examination and laboratory indicators, amount to 201189 records.

The Hosmer Lemeshow goodness of fit index (H-L) [7] is used to verify the time series data after filling, and the significance of the result is 0.553 which greater than 0.05. There is no significant difference between the predicted value and the observed value, which proves that the model has a good fit.

3.1 Key Risk Indicators

The key indexes of trauma-related sepsis are extracted by feature selection. After 20 separate repeated experiments, the key indexes with retention times more than or equal to 16 times are Urea Nitrogen, PTT, PO2, Sodium, Red Blood Cells, CO2, and INR. The key indicators for retention between 12 and 16 times are Lactate and White Blood Cells. Hematocrit, Chloride, Hemoglobin, Temperature, Base Excess are the key indicators of retention between 10 and 12 times. Key indicators with retention times between 8 and 10 are Heart Rate, PCO2, Glucose, Platelet, Creatinine, and Calcium. The more times a single indicator is retained, the greater the influence of this indicator on the outcome and the stronger the ability to identify patients.

In this study, the feature_importances function of XGBoost is used to obtain the characteristic importance of each index. The results are shown in the following Table 1.

Table 1. Indicator weights and rankings

Compared with the calculation results, find that Urea Nitrogen and PTT, with the most retention times of key indicators, rank the first and the sixth in the ranking of characteristic importance, respectively, which proved that there is a causal relationship between key indicators and the accuracy of the predicted results of trauma-related sepsis. The weight of characteristic importance does not represent the correlation degree between the indicators and the predicted results, but it proves the correlation between the indicators and the predicted results, which lays a foundation for the subsequent prediction of trauma-related sepsis (Table 2).

Table 2. Forecast the distribution of key indicators

By summing up the weights of all the key indicators, the calculation results show that the weight summation of Blood Biochemical and Blood Routine is the highest, and the distribution is 0.308 and 0.21. To a certain extent, it proved the important value of Blood Biochemical and Blood Routine in the prediction of trauma-related sepsis, followed by Arterial Blood Gas and Coagulation Function, and Vital Signs had the least influence in the prediction of trauma-related sepsis (Fig 2).

3.2 Time Window Forecast

Under the time window model of the full index data set, the accuracy rate, recall rate, and precision rate of the four model algorithms are between 64% and 83%, which meet the requirements of clinical medicine. The best prediction effect is the Random Forest. From the perspective of time, although the prediction effect fluctuates slightly, the overall accuracy decreases with the increase of time, which is in line with the actual prediction logic. Moreover, the overall model performance increase with time, but the changing trend is not obvious, which proved the stability of the model and is more conducive to the earlier prediction and early warning of trauma-related sepsis (Table 3, Fig. 3).

Table 3. Comparison of prediction time parameters of all indexes in different Time
Fig. 2.
figure 2

ROC curve and AUC of internal validation of each model in the full index data set.

According to the model AUC value, all the models are higher than 0.64 under different time window parameter Settings, which can meet the dynamic requirements (Table 4).

Fig. 3.
figure 3

AUC comparison of internal validation among models in the full indicator set.

Table 4. Comparison of prediction time parameters of key index datasets in different sets
Fig. 4.
figure 4

ROC curve and AUC of internal validation of each model in different sets.

Under the time window model of “Key Indicator Set 1–4”, the performance results of each machine learning method are still above 63%, and the accuracy, recall, and precision rate of Random Forest are all the best. Although the effect of different indicators fluctuated slightly, in general, the prediction effect decreased with the decrease of the number of key indicators (Fig. 4).

According to the model AUC value, all models are higher than 0.63 under different time window parameter Settings, among which Random Forest performed the best, followed by XGBoost, Adaboost, and Decision Tree. With the decrease of the amount of key index data, the prediction effect has a certain tendency to decrease, but it is not obvious, which does not affect the application requirements of the dynamic real-time time window prediction model for trauma-related sepsis, which is in line with clinical practice, and proves the generalization ability of the time window prediction model for trauma-related sepsis (Fig. 5).

Fig. 5.
figure 5

AUC comparison of internal validation among models in the key indicator set.

3.3 Risk Score

In this study, the two indexes with the highest weighted sum of feature importance, blood biochemical and blood routine, are selected to construct a risk scoring tool. Creatinine is taken as a constant reference index, and the results are as follows (Table 5).

Table 5. Trauma-related sepsis risk scoring tool

Taking a wounded patient as an example, the sum of all index scores is 8, so the probability of the patient suffering from sepsis at that moment is 91.58% (Table 6).

Table 6. Comparison table of the total score and risk prediction probability of trauma-related sepsis

Input the time series data of trauma patients into the risk scoring model can get the change of the risk probability of the trauma patients suffering from sepsis, which reflects the increase in the severity of trauma-related infection over time. Taking a certain trauma patient as an example, the data from 4 h before to 24 h after the treatment of antibiotics showed that the patient's condition improved after the treatment of antibiotics, but only lasted for a while. The subsequent changes in the patient's condition are not detected in time, leading to a rapid increase in the possibility of the patient suffering from sepsis (Table 7, Fig. 6).

Table 7. Changes of risk scores and risk probability in a period of A trauma patient
Fig. 6.
figure 6

Changes of risk scores in a period of one trauma patient.

The AUC value of the trauma-related sepsis risk scoring tool is 0.79, which proved that the model has a certain generalization ability and can meet the clinical needs.

4 Conclusion

Sepsis is one of the main causes of death in patients with trauma-related infection in clinical practice. The onset of sepsis is serious and develops rapidly. Once diagnosed, a large amount of time for diagnosis and treatment has been lost, and it is difficult to control the development of sepsis. At the same time, due to the particularity of sepsis, the diagnosis process is complicated and needs a long time, which leads to the delay in the judgment of trauma-related sepsis by medical staff. Therefore, this study realizes the trauma-related sepsis predict warning, as well as the risk score. The model has good generalization ability, and can basically meet the clinical practical application, could help doctors perceive ahead of trauma-related sepsis in patients development trend and extent, thus early medical intervention on the patients, Controlling the development of sepsis will greatly improve the treatment rate of sepsis.

The next step is to validate the model in clinical trials. The diagnosis of sepsis needs to base on the changes of various indicators of patients over some time, so the data obtained in this study is limited, and the conditions of sepsis patients are diverse. As the limited data cannot accurately display all the clinical manifestations of sepsis, the model obtained in this study cannot achieve higher performance. More effective and real clinical data can optimize and improve the model.