Keywords

1 Introduction

An ICU serves patients with severe complications or life-threatening injuries, which involve constant care in order to maintain normal bodily functions. To improve hospital services, it seems important to adequately select patients to be admitted to ICUs early on. In an ICU, the patient is monitored using Electronic Health Record (EHR) systems, entering many medical data a day including physiological measurements. Finding statistic models in these measurements has the potential to provide a high aptitude for more accurate and earlier predictions of future clinical events. This might not only help clinicians make more effective medical decisions but also facilitate an economical allocation of hospital resources. Naturally, mortality prediction and Length of Stay (LOS), are mainly performed with an interest in the prediction of possible outcomes, which are the death or survival of the patient, and for how long a patient may remain in the intensive units. Nevertheless, most available mortality and LOS prediction systems [1,2,3,4] in the literature were designed for at least 24-hour to provide a real-time or retrospective prediction on patients’ mortality. To enhance prediction for early diagnosis, the main objective of this paper is to develop an end-to-end approach based on deep learning models, within a data mining framework, specifically intended for predicting mortality and LOS, based on multivariate time-series physiological measurements from the first few hours of admission, in particular after the first 6 h of a patient’s acceptance in the ICU. The rest of the paper is organized as follows: Sect. 2 provides a comprehensive literature review on the state-of-the-art works. Section 3 details the process of dataset collection and preparation. Section 4 discuss the proposed model and presents its configuration and implementation tools. To consider the effectiveness of the proposed method, Sect. 5 deals with experiments. Ultimately, Sect. 6 concludes the paper and highlights the fundamental contributions.

2 Related Works

Over the past few decades, substantial researches are undertaken to affect predicting mortality risk and LOS tasks. A number of the more frequently used mortality prediction models in an ICU setting include SAPS-II [1] and SOFA [2]. SAPS-II was designed to estimate the probability of mortality, while SOFA was wont to describe organ dysfunction. Using the primary 24-hour patient physiological measurements, these scores are only designed to form one prediction. As a result, it’s unknown how well each system predicts mortality following the primary day of admission. Moreover, it seems intuitively likely that straightforward clinical judgment also will discriminate more effectively as time passes. Existing tools are therefore slow to succeed in useful discriminatory effectiveness and aren’t generally felt by clinicians to be useful to help decision-making once they will discriminate.

Adding to severity scores, several authors have converged on management mortality risk, as an example, Pirracchio et al. [4] aimed to develop a scoring procedure to predict mortality in ICUs supported Super-Learner (SL) model. They have proved that the SL method improved performance. However, the authors evaluated the performance of SL using data recorded within the primary 24-hour. Moreover, Darabi et al. [5] developed a model supported Gradient Boosted Tree (GBT) and Convolutional Neural Network (CNN) to estimate the mortality risk of patients admitted to ICUs. Their results prove usability a smaller number of features which will generate satisfactory outcomes for GBT, unlike, CNN that need a wealthy amount of knowledge for training. However, their model was designed within the period of 30-day after admittance.

In addition to mortality risk prediction, few researchers have converged to estimate LOS. Mentioning Gentimis et al. [6] who explored the utilization of Neural Network (NN) for predicting the entire LOS of a patient within the hospital. The predictive model outperforms machine learning models. However, the studied scenarios considered time-frames \(> 5\) days, or \(\le 5\) days, to validate the potency of the model. Furthermore, Zebin et al. [7] applied an Auto-Encoder (AE) along side a dense neural network technique attempted at identifying short and long stays for patients. The proposed model improved the performance compared to employing a simplistic dense neural network for the classification task. However, their assessment results were validated using recordings observed after 24 h of admission.

To conclude, all the above-mentioned works only focused on predicting the risk of mortality and LOS for patients who required intensive care within a minimum of 24-hour of their ICU admission [3,4,5, 8, 9]. The challenge, therefore, lies within the early hours of a patient’s admission, for instance, the primary 6 and 12 h. Additionally, not all critically ill patients can enjoy ICU admission. Hence, determining the priority of patients’ treatments by the severity of their condition is crucial because the ICU is extremely costly with limited resources. The challenge, therefore, lies in triaging patients consistent with their medical conditions, while estimating their expected time of hospitalization. Adding to the present , most research has centered on the evaluation of the efficiency of their predictive models using univariate time-series data and that they didn’t consider the potency of multivariate time-series records for improving the accuracy and therefore the efficiency of time-series modeling [10].

3 Dataset

This effort is conducted over the well-known publicly available, large-scale ICU database, the MIMIC-III [11], which presents a single-center electronic database developed by the MIT Lab for Computational Physiology, comprising health data related to 61.532 ICU admissions of 46.520 distinct de-identified patients admitted between 2012 and 2020.

3.1 Feature Engineering

Every day, different vital signs measurements are computed and analyzed during intensive stays. In this proceeding, we focused primarily, in hidden patterns within ICU time-series data and investigated the hypothesis that there is much useful knowledge in motifs within these data that can aid to improve prediction clinical tasks. This hypothesis is motivated by observations considered within several studies, for example, in [12], we found that in the event of a lack of oxygen transport, measurements in this time-frame of associated variables increase the risk of death. We therefore explored some temporal variables defined in acuity severity scoring systems and added others since they have proven to possess a powerful effect in predicting mortality and hence LOS [13]. These variables include “heart rate”, “systolic BP”, “diastolic BP”, “mean BP”, “respiratory rate”, “oxygen saturation”, “glasgow coma score”, “blood urea nitrogen”, “temperature”, “white blood cells”, and last not least “bilirubin”.

Some of the foremost pertinent measurements could also be obtained using information available within the earliest phase [3]. So, we’ve extracted features for the primary 6 hours for every ICU stay. We have also extracted features for the 12 and 24 h so as to verify the effectiveness of the proposed model in maintaining its accuracy for long periods.

3.2 Feature Preprocessing

EHRs contain valuable information for estimating mortality risk and discharge time for ICU patients, but substantial missing and imbalanced data present mutual problems for the development and implementation of a prediction model. Hence, the subsequent two issues were identified and handled accordingly.

Missing Data Imputation. The percent of missing values for certain features is higher than 50%. To manage this problem, data imputation was performed including two strategies: we start by filling them using linear interpolation on each multivariate time-series data. Some observations are still missing after this imputation since there are missing data for certain variables. Hence, we impute missing observations using the Mean as the second strategy.

Imbalanced Data Regulation. The number of patients who passed away inside the intensive department is relatively small in comparison with the number of patients who survived, yielding an imbalanced dataset. To manage this problem, re-sampling methods were adopted since they are less sensitive to outliers than other techniques like Cost-sensitive classifiers [14] and Automatic support vector data description [15]. Two of the most common categories of re-sampling methods are under-sampling and over-sampling strategies. The former remove observations from the training dataset that belong to the dominant class, while the latter duplicate samples that belong to the lesser class, thus increasing its impact within the training process. We have applied the former on the dataset since the latter would make models inflexible in learning during the training process by causing overfitting. As a result, the size of the data was reduced from 33.6 Mo to 7.76 Mo, from 66, 7 Mo to 15 Mo and from 129 Mo to 29 Mo, over the 6-hour, 12-hour and 24-hour time-frames, respectively.

4 Methodology

The idea behind time-series prediction is to predict future events supported past values with reference to historical measurements and associated patterns. Turning to the philosophy of the research methodology, we would like to hold relevant information throughout the processing of medical data sequences, as physiological variables begin to decrease or increase over a period of your time, thus making it possible to predict future outcomes associated with patient conditions in care units. To reach these specific goals, a typical two-stage architecture is presented in Fig. 1.

Fig. 1.
figure 1

A summary structure of two-stage architecture: within the first stage, a binary classifier is trained to predict mortality. Then, if the mortality is predicted to be positive, the model would further provide an estimation about LOS.

The philosophy behind the defined architecture is detailed as follows: we start by interpreting multivariate time-series of 11 past clinical records for each patient \(P_i\):

$$\begin{aligned} P_i: X_{1,t_{k}}, X_{2,t_{k}}, ..., X_{11,t_{k}} \end{aligned}$$
(1)

with \(k = 1, \ldots , n\) and \(n \in \) {6-hour, 12-hour, 24-hour}. In the first stage, a binary classifier is trained to predict the risk of mortality. In a mathematical interpretation we identify:

$$\begin{aligned} Class = {\left\{ \begin{array}{ll} 0, &{} \text {survivors group}\\ 1, &{} \text {non-survivors group}. \end{array}\right. } \end{aligned}$$

Therefore, we define a knowledge set of two exclusion criteria: we start by filtering by \( 16\le \) age \(\le 89\) [8]. Then, we exclude ICU stays of but one hour to get rid of obscurity in data due to unusual short stays. After filtering, we observe 49.632 ICU stays of 36.343 patients. While a multi-class classifier is trained at the second stage using a similar vital signs so as to predict LOS for those that are predicted dead in stage 1. Accordingly, we filter the ICU stays with death time \(\le 0\). As a result, 5.718 in-hospital mortalities were obtained. We then label each data to at least one of the four classes represented below:

$$\begin{aligned} Class = {\left\{ \begin{array}{ll} 0, &{} \text {if } death\_time\_hours< 6, \\ 1, &{} \text {if } 6 \le death\_time\_hours< 12, \\ 2, &{} \text {if } 12 \le death\_time\_hours < 24, \\ 3, &{} \text {otherwise}. \end{array}\right. }\end{aligned}$$

The proposed model will predict outcomes values by identifying short-term (6 h/12 h) and long-term (24 h) dependencies. For this purpose, we have employed the LSTM architecture [19]. This type of network improves the simple Multilayer Perceptron (MLP) network by including an output that depends on historical learned informations. The LSTM architecture is characterized by hidden units, called memory blocks. These units allow the network to remember information over short/long sequences. Moreover, these gates allow the LSTM model to beat the issues that inhibit the training of other deep models including RNNs and MLPs. This, and therefore the impressive results that may be achieved, are the rationale for its popularity on an outsized sort of problems [16, 17].

4.1 Model Configuration

The efficient implementation of deep learning requires the selection and optimization of many hyperparameters, as well as extensive trial and error to find the optimal values. In order to assess the advanced performance, data is divided into training, test and validation sets; The training set is being used to train learning classifier; the validation set is used to fine-tune the parameters and estimate the behavior of the classifier; and the test set is going to be used to determine the efficiency of the classifier. Once data is splitted, we tune models using K-fold cross-validation. In this study, we set K = 3. The implemented LSTM model used Tanh activation function in the hidden layers and Sigmoid activation function in the output layer. Dropout with a rate of 0.2 is used as a regularization technique for weight optimization. In our model, a learning rate of \(1e^{-03}\) is used, the number of epochs to train is set to 60 and the batch size is set to 100.

4.2 Model Implementation

In this work, the model was implemented using Keras framework, with TensorFlow backend. The implementation part of the proposed model consists of two stages:

  1. 1.

    Feature Engineering: we chose big data tool like Apache Hive 2.1.0 on Microsoft Azure remote cluster (2 head nodes and 1 worker node, each with 200 GB space, 14 GB RAM, and 4 processors), to perform data preprocessing and feature engineering.

  2. 2.

    Deep Learning using Colaboratory.

We also used Python and several packages for efficient model testing, hyperparameter tuning and model evaluation including: Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn.

5 Experimental Results

In this section, we describe the results of our experiments by evaluating the LSTM model against the traditional state of the art acuity scores and machine learning approaches that were used to predict possible future clinical events supported time-series measurements, including SOFA score, SAPS-II score, SL, SVM, LR, NB and CNN. Individual sets of parameters were tuned using 3-fold cross-validation to evaluate the potency of every fixed model. Experiments were conducted under three settings: using temporal physiological measures within 6-hour, 12-hour, and 24-hour time-frames. It’s worth noting that SAPS-II and SOFA acuity scores use the primary 24 h of data to evaluate patient severity of illness.

For binary-classifier, we opt for F1-score and MCC metrics to evaluate the effectiveness of the model. In gist, these two metrics were chosen because they provide a more realistic measure of a model’s performance, and hence they are robust for binary classification problems [18].

Results outputs of different classifiers are presented in Table 1. In the light of the obtained results, fitting an LSTM model on the multivariate time-series records within a 6-hour time-frame has improved the prediction of early diagnosis of mortality risk for patients who remained in intensive departments. In fact, it is often seen from Table 1 that the LSTM model under the tuned configuration features a higher F1-score and MCC compare to the opposite mortality predictive approaches, which approved that the performance of the LSTM model is more consistent. Although the CNN model has attained a better F1-score and MCC within a 24-hour time-frame, the LSTM model outperformed it within 6-hour and 12-hour time-frames, validating its potency in predicting mortality risk as soon as possible following the admission of patients to the critical units.

Table 1. Mortality prediction performance for binary-classification approaches (The best performing model is highlighted in bold).

Regarding multi-class classification, the average of the evaluation measures can provides a view on the overall results for the potency of LSTM fitted on the data aggregated over 6-hour within the prediction of LOS compared to those aggregated over 12-hour and 24-hour time-frames. Two major names to refer to averaged results are micro-average and macro-average. In gist, a macro-average will compute the metric independently for every class then take the average, whereas a micro-average will aggregate the contributions of whole classes to compute the average metric. Figure 2 summarizes Micro and Macro-average results for AUROC metrics and confirms that multivariate time-series data aggregated over a 6-hour time-frame offer rigorous multi-classification results compared with 12-hour and 24-hour time-frames that indicate slight improvement results.

Fig. 2.
figure 2

ROC curves of the LSTM model fitted on data aggregated over 6-hour (in the left), 12-hour (in the middle), and 24-hour time-frames (in the right), applied for the multi-classification problem.

6 Conclusion and Future Works

Enhancing the excellence of care for patients and predicting future outcomes are the foremost important targets in critical care research. In this paper, and by deploying multivariate time-series data obtained from EHR-database MIMIC-III, we reveal that the LSTM model systematically outperforms all opposing predictive models of mortality using physiological measures observed during 6 and 12 h. These positive results recommend that access to the patient’s physiological data trajectory as early as possible could enhance the potential in monitoring and predicting possible future events concerning the patient’s conditions in ICUs. In future work, we arrange to apply the proposed model in other clinical tasks including early triage and risk assessment, prediction of physiologic decompensation, and identification of high-cost patients.