Introduction

Heart failure (HF) is a global problem with an estimated prevalence of 64.3 million people worldwide, and the most common diagnosis in patients aged 65 years or older admitted to hospital [1, 2]. Although some progress in the treatment, the prognosis of HF is worse than that of most of cancers [3]. Numerous patients with HF are characterized by the presence of tissue congestion due to fluid retention, which is also called congestive heart failure (CHF) [4]. On a pulmonary level, cardiac dysfunction could lead to an increased pulmonary capillary wedge pressure (PCWP), causing interstitial and alveolar edema, with subsequent dyspnea.

Mechanical ventilation (MV) is commonly used in patients with CHF to relieve respiratory failure and reduce work done by the heart [5]. Due to the aging population and the progress in treatments, the MV ratio and the duration of MV have been increasing in recent years [6]. MV is a life-saving treatment; however, it also has some life-threatening complications, including ventilator-associated pneumonia, hemodynamic effects, and sedation agent toxicity [7], and the duration of MV is associated with a near-linear relationship with outcomes [8]. Prolonged MV (PMV) has been defined as 21 days of MV by experts [9]; however, many clinical studies have defined PMV as duration of MV support of >4, >7 or >14 days as these provide greater clinical applicability [10,11,12]. It is uncontroversial that PMV is strongly associated with poor prognosis [13]. Nevertheless, there is a lack of studies to predict PMV in patients with CHF.

Machine learning (ML), which merges statistical analysis with computer science to produce algorithms, is a subset of artificial intelligence (AI) and could detect relationships between potential explanatory features and a known target outcome [14]. In the previous study, we have developed and validated a CatBoost model, which could accurately predict hospital mortality in mechanically ventilated patients with CHF [15]. In the present study, we aimed to establish a prediction model based on supervised-ML to early identify patients with CHF requiring PMV.

Methods

Sources of Data

Patients’ data were collected from the Medical Information Mart for Intensive Care IV (MIMIC-IV, version 1.0), which is a large single-center database containing information of 257,366 individuals who were admitted to various ICUs of the Beth Israel Deaconess Medical Center (BIDMC) between 2008 and 2019 [16]. The code for data extraction is available on GitHub (https://github.com/MIT-LCP/mimic-iv). The data of the external validation set were extracted from the eICU Collaborative Research Database (eICU, version 2.0), which is a multi-center ICU database with high granularity data for over 200,000 admissions to ICUs of 208 hospitals located throughout the United States between 2014 and 2015 [17]. The Massachusetts Institute of Technology approved the establishment of the databases with an informed consent exemption. The study was reported according to the recommendations of the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement [18].

Study Population and Definitions

In the present study, PMV was defined as the duration of MV ≥ 4 days. Specifically, PMV was categorized as prolonged acute MV (PAMV) (duration of MV ≥ 4 days), short-term PMV (duration of MV ≥ 7 days), and long-term PMV (duration of MV ≥ 14 days). The primary purpose was to build a prediction model for PAMV, and the secondary purposes were to establish the models to predict the short-term PMV and long-term PMV based on the MIMIC-IV cohort.

Participants were selected by the following inclusion criteria: (a) aged > 18 years; (b) with CHF as the major cause for admission; (c) received MV in the first 24-h of ICU admission. Patients with incomplete data were excluded.

Data Collection

PostgreSQL tool (version 13.0) was used to extract medical data from the two databases. The subject IDs were used to identify distinct patients. The demographics, vital signs, common comorbidities, and laboratory tests were included in the initial analysis. In addition, some patients had more than one ICU admission, the data of the first ICU admission and the first medical records during ICU stay were selected for analysis. In the present study, CHF was diagnosed by clinicians based on the guideline of heart failure [19]; pneumonia included community-acquired and hospital-acquired pneumonia, the diagnosis criteria were described elsewhere [20]; acute kidney disease (AKI) was diagnosed based on creatinine or urine output criteria of the clinical guideline [21]; sepsis was diagnosed based on the Sepsis-3 criterion [22]; and ventricular arrhythmia included ventricular tachycardia (sustained or non-sustained), ventricular flutter and ventricular fibrillation. The definitions of other included variables are shown in Supplementary Table S1.

Development of the Prediction Model

Key Features Identification

It is necessary to identify the key features from the training data, which can increase the speed of model fitting and improve the feasibility of the model. In the present study, the least absolute shrinkage and selection operator (LASSO) regression, which could automatically remove unnecessary or uninfluential covariates based on a penalization coefficient, was used to eliminate the irrelevant variables [23]. Moreover, the Shapley additive explanations (SHAP) values were used to assess the importance of each feature using a game-theoretic approach using the functions of the SHAP Python package version 0.40.0 [24].

ML Algorithm Selection

A total of 12 widely used ML algorithms including AdaBoost, CatBoost, XGBoost, LightGBM, logistic regression (LR), bootstrap aggregating (Bagging), decision tree, random forest, support vector machine (SVM), naïve Bayes, multi-layer perceptron neural networks (MLP), and K nearest neighbors (KNN) models were used to initially predict the PMV based on the selected key features. The discrimination of the above models was assessed by the area under the receiver operating characteristic curve (AUC). In addition, accuracy, positive prediction value (PPV), negative prediction value (NPV), balanced accuracy (BA), and F1-score were performed in each model to evaluate the prediction performance. Moreover, the calibration curve and Brier score were applied to qualitatively and quantitively evaluate the calibration of models, respectively. To demonstrate the decision benefit, the decision curve analysis (DCA) was also conducted. The algorithm with the best discrimination and calibration was selected to perform the remaining analyses.

Hyperparameters Optimization

The hyperparameters, including learning rate and a max of learning depth, are essential for ML algorithms, an ideal combination of hyperparameters could significantly improve the model performance. Accordingly, we performed the hyperparameters optimization (HPO) using the Optuna version 2.10.0, an open-source optimization framework, which enabled us to write complex deep learning experiments quickly, efficiently, and dynamically, to perform the optimization with the Hyperband method and test different combinations of hyperparameters [25]. We performed 100 trials to obtain the best combination of the hyperparameters.

Model Fitting and Evaluation

The final model was fitted with the best combination of the hyperparameters. The data from the MIMIC-IV database were randomly divided into a training set (80%) and an internal validation set (20%). Moreover, the data from the eICU database were used to perform the external validation. To further evaluate the prediction performance of the final model, comparisons of prediction performance between the final model and sequential organ failure assessment (SOFA) score [26], simplified acute physiology score (SAPS-II) [27], logistic organ dysfunction system (LODS) score [28], and a combination of commonly used blood gas indicators, including partial arterial oxygen pressure (PaO2), partial arterial carbon dioxygen pressure (PaCO2), a saturation of arterial blood oxygen (SO2), PaO2, and fraction of inspiration O2 ratio (PFR), PH, base excess (BE), bicarbonate radical and lactate, were conducted. In addition, PMV is strongly associated with mortality, and prediction of hospital mortality based on the final model was conducted to assess the prediction performance. Finally, a nomogram was performed to visually present the prediction result. Recursive partitioning analysis was performed to construct a survival decision tree for risk stratification with R package ‘rpart’ based on the total points of the nomogram of each participant. Moreover, the prediction for every patient was plotted in order of their risk to assess the prediction distribution from the model.

Statistical Analyses

The Kolmogorov–Smirnov test was used to evaluate the normal distribution of the data. Continuous variables were expressed as mean ± standard deviation (SD) and compared using t-test. Levene’s homogeneity of variance test was used to test the assumption of homoscedasticity. If the homoscedasticity was unsatisfied, the Welch’s t-test was used for performing comparisons between groups. Categorical data were expressed as proportions and were compared using the chi-squared test. Additionally, the variance inflation factor (VIF) was assessed among the covariates in the nomogram, and VIF > 4.0 was interpreted as indicating multicollinearity. Variables with VIF > 4.0 were not included in the final model analysis. Univariate and multivariate Logistic regression analyses were performed for all variables of the nomogram to further identify the independent risk factors.

The logic check of the data was conducted to identify the extreme and error value, then to replace them with mean values. Variables with missing values> 30% were excluded. To fill the missing value of the included variables, multiple imputation was performed in the present study [29]. All the statistical analyses were conducted using Python version 3.9.0 (Python Software Foundation, www.python.org) and R software version 4.0.4 (R Foundation for Statistical Computing, Vienna, Austria). A two-tailed test was performed, and a P-value < 0.05 was considered to reflect statistical significance.

Results

Baseline Characteristics

A total of 4533 mechanically ventilated patients with CHF were included in the MIMIC-IV database, and all participants were divided into PMV (n = 1268) and non-PMV (n = 3265) groups based on the duration of MV (Supplementary Fig. S1). In addition, 65 features, including demographics and characteristics, vital signs, therapy, comorbidities, and laboratory tests, were selected for the initial analysis (Supplementary Table S1). After featuring elimination through LASSO regression, 10 key variables, including Glasgow coma scale (GCS) score, PFR, cardiac troponin T (cTnT), N-terminal pro-B type natriuretic peptide (NT-proBNP), diagnosis of pneumonia, sepsis, AKI, anemia, use of inotropic agents, and urine output (UO), were used for establishing the compact model and performing external validation of the eICU database. The variable comparisons of the non-PMV and the PMV groups of the MIMIC-IV and the eICU databases are shown in Table 1.

Table 1 Baseline characteristic of the cohorts

Feature Selection

The LASSO regression was performed to identify the key features. The process of feature selection through LASSO regression is shown in Supplementary Fig. S2, with the penalty coefficient (Lambda) increased, the corresponding coefficients of some variables gradually approached zero, which reflected that these variables had little influence on the model. After LASSO regression, 10 key features were selected. The feature importance of the 10 variables was evaluated by SHAP value (Fig. 1). To identify the multicollinearity between the 10 features, the VIF test was performed and showed that there was no significant multicollinearity with all VIFs of the features less than 4.0, the mean VIF was 1.07. Univariate and multivariate Logistic regression analyses were conducted to quantitatively evaluate the influence of the 10 variables on the model. The result suggested that all 10 features had a significant influence on PMV in patients with CHF (Table 2).

Fig. 1
figure 1

Features importance assessed by SHAP value. (A) The blue to red color represents the feature value (red high, blue low). The x-axis measures the impacts on the model output (right positive, left negative); (B) Significance of the predictors in the CatBoost model

Table 2 Univariate and multivariate logistic regression analyses for prolonged acute mechanical ventilation

Development of the CatBoost Model

Based on the 10 key features, a total of 12 commonly used ML algorithms without parameter optimization were used for the initial analysis, and we found that the CatBoost algorithm had the best prediction model (AUC = 0.790) (Fig. 2). In addition, AUC, accuracy, PPV, NPV, BA, and F1-score of different models based on the 10 features in the internal validation set are summarized in Supplementary Table 3. As shown in the table, the CatBoost model had the best AUC (0.790), NPV (82.0%), BA (67.9%), and F1 score (0.532). To assess the calibration of the 12 ML models, the calibration curve was conducted as in Fig. 3A, and for simplicity, only LR, XGBoost, and KNN models were selected as the references. We found that the prediction probability of the CatBoost model was the closest to the true probability compared with the others, with a Brier score of 0.154. A satisfactory result was also shown in the DCA (Fig. 3B). Accordingly, the CatBoost model was used to perform the remaining analyses.

Fig. 2
figure 2

Prediction performance of different machine learning algorithms

Fig. 3
figure 3

Model evaluation and validation. (A) Calibration plots; (B) Decision curve analysis; (C) Comparisons of the CatBoost and other models for prolonged acute mechanical ventilation (PAMV) based on the compact 11 features; (D) Prediction of the CatBoost model for predicting PAMV, short-term prolonged mechanical ventilation (PMV) and long-term PMV; (E) Prediction of in-hospital death based on the CatBoost model; (F) External validation of the prediction model based on the eICU dataset

HPO based on the 10 features was conducted to improve the prediction performance of the CatBoost model (Supplementary Fig. S3). After 100 trials, the best combination of hyperparameters with the best AUC was obtained. The final settings of the hyperparameter search are listed in Supplementary Table S3. Additionally, comparisons of the full model with the compact model and the final model after HPO were conducted to demonstrate the optimization effect. As shown in Supplementary Fig. S4, the prediction performance of the CatBoost model was significantly improved (AUC: 0.790 vs. 0.817).

Model Evaluation and Validation

To further demonstrate the prediction performance of the CatBoost model, comparisons of the final model with SOFA score, SAPS, blood gas analysis, and LODS score for predicting PAMV in patients with CHF were performed (Fig. 3C). As shown in the figure, the CatBoost model had the best prediction performance compared with others. Moreover, we performed another prediction of short-term MV (7d), long-term MV (14d), and in-hospital death based on the CatBoost model, and favorable results were obtained in the present study (Fig. 3D-E).

External validation was conducted in this study. As shown in Fig. 3F, in the external validation set with the 10 features, the CatBoost model had the best prediction performance among the 12 algorithms (AUC = 0.780). The accuracy of the CatBoost model in the external validation set was also the highest among the models, with an accuracy of 69.4%.

Finally, a nomogram was performed to visually represent the prediction result. The PAMV probabilities of each patient in the external validation set were obtained using the ‘predict_proba’ function of the CatBoost algorithm. Moreover, the patients were sorted based on the prediction probability. The prediction distribution plot of the CatBoost model with patients sorted in the order of risk showed positive clustering of patients who suffered from PAMV, suggesting the satisfactory discriminatory ability of the model (Fig. 4A). Furthermore, the decision tree based on total point of each patient in the nomogram was employed to optimize the risk stratification (Fig. 4B). All patients were divided into three groups: low-risk (total points < 186), middle-risk (total points ≥ 186 and < 228), and high-risk group (total points ≥ 228) (Fig. 4C). In addition, we found that patients in the middle- and high-risk group presented with 3.25 and 9.26-fold risk of PAMV, respectively, with the low-risk group as the reference (Fig. 4D). In addition, the nomograms of prediction of short-term and long-term PMV were shown in Supplementary Figs. S5 and S6.

Fig. 4
figure 4

Model validation and risk stratification. (A) Prediction distributions of the risk of PAMV; (B) The risk stratification of PAMV according to a decision tree; (C) A nomogram of the CatBoost model for predicting PAMV; (D) Logistic analysis of the risk of PAMV based on the risk stratification

Discussion

A prediction model of PMV was established based on 12 commonly used ML algorithms. Among the 12 ML algorithms, the CatBoost showed the best prediction performance. In addition, we found that several factors, including pneumonia, GCS score, and NT-proBNP, were strongly associated with PMV in patients with CHF. The prediction performance of the CatBoost model for predicting PMV was significantly better than the traditional methods, including SOFA score, SAPS, LODS score, and blood gas. Moreover, the CatBoost model showed satisfactory prediction performance both in the internal validation and the external validation sets. Additionally, hospital mortality could be accurately predicted using the final model as well.

The CatBoost algorithm, which belongs to the gradient boosting algorithm family, outperforms other publicly available boosting implementations in terms of quality on a variety of datasets [30]. Two main innovations introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features, both of them can overcome a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms [31]. In addition, CatBoost uses a new schema to calculate leaf values when selecting the tree structure. The schema helps to reduce overfitting, which constrains the generalization ability of machine-learning models [31]. In the present study, the CatBoost model showed satisfactory performance as well.

Stasis of pulmonary vein and capillary is an important pathophysiological feature in patients with CHF, and increased PCWP could lead to interstitial and alveolar edema and pleural effusion, causing acute respiratory failure [32]. Accordingly, MV is commonly used in patients with CHF to relieve dyspnea. Inflammation of pulmonary parenchyma and interstitium caused by pneumonia could aggravate lung consolidation and reduce effective respiration area [33], then exacerbate respiratory failure in patients with CHF. In addition, sepsis could aggravate hypoxemia through several mechanisms, including overaction of immune response, circulatory abnormalities, and endothelial dysfunction of the lung [34]. Accordingly, patients with CHF combined with pneumonia or sepsis are likely to suffer from PMV.

NT-proBNP, which is a cleaved proteolytic product, has been shown as an efficient marker in patients with CHF to evaluate prognosis and cardiac dysfunction [35]. Additionally, NT-proBNP could reflect the degree of pulmonary congestion [36]; therefore, NT-proBNP may be associated with respiratory failure in patients with CHF. Moreover, cTnT, which is increased both as a result of ischemic or non-ischemic myocardial injury, is strongly associated with cardiac dysfunction [37]. CHF reduces renal perfusion and then causes AKI, and renal dysfunction, in turn, increases the absorption of sodium water, placing a heavy burden on the heart. Therefore, the renal injury will deteriorate cardiac dysfunction in patients with CHF. Our findings suggested that NT-proBNP, cTnT, UO, and AKI were strongly associated with PMV as expected. In addition, use of inotropic agents and GCS scores were also found as the prediction factors for PMV in this study, and the results would be helpful in future research.

Blood gas analysis, SOFA score, LODS score, and SAPS-II are widely used to evaluate the respiratory status and the severity of diseases. We conducted comparisons of the CatBoost model with these models to predict PMV and found that the CatBoost model was significantly superior to these models. The previous study has demonstrated that PMV was significantly associated with mortality [13]. Accordingly, we hypothesize that hospital mortality could be accurately predicted using the final model which is used to predict PMV. The results showed that the CatBoost model exhibited favorable prediction performance to predict in-hospital mortality based on the data of the selected 10 features (AUC = 0.844). In addition, external validation was performed in the present study. We found that the prediction performance of the CatBoost model was also excellent in the eICU database.

The decision tree algorithm based on the nomogram was used to optimize the risk stratification of PMV in CHF patients (Fig. 4C). Due to the 10 features for prediction were readily accessible and frequently monitored in routine clinical practice, CHF patients with high-risk of PMV could be easily identified using this nomogram. Our prediction model may lead to several changes in the management of mechanically ventilated patients with CHF such as prediction of ICU workload and subsequent cost incurred by the hospital, and the ability to rationalize and tailor mechanical ventilation strategy for high-risk patients. Moreover, the CatBoost model may facilitate decision making regard to the timing of tracheostomy in the high-risk of PMV patients. Timely tracheostomy not only had shorter duration of MV and intensive care unit length of stay but also markedly lower hospital mortality [38]. Accordingly, we suggest that our model certainly had clinical application value in the early identification of PMV in patients with CHF and in the administration of appropriate preventive treatments.

Limitations

Several limitations should be considered in this study. First, the data for analysis were collected from two public databases. There were some important variables, including left ventricular ejection fraction, with a high rate of missing values were excluded, which may affect the prediction performance. Second, there is no consensus on the accepted definition of PMV in CHF patients, while other investigators have used a wide range of time frames within a range of 4 to 21 days to define PMV. Considering the progresses in the management of CHF, the decompensated symptoms of CHF could be more rapidly corrected than before, we defined PMV as MV ≥ 4 days in this study. Third, this was a retrospective cohort study, and further prospective studies are needed to confirm the findings. Accordingly, the results should be interpreted with caution.

Conclusion

A prediction model of PMV with satisfactory prediction performance was established based on 12 commonly used ML algorithms. The CatBoost model was better than other predictive methods in predicting PMV in patients with CHF. Additionally, the CatBoost model may help in predicting in-hospital mortality in these patients. These findings need to be confirmed in future prospective studies.