Keywords

1 Introduction

Successful modeling of complex multivariate event time series and their ability to predict future events is important for applications in various areas of science, engineering, and business. In clinical settings our ability to predict future events for a patient based on clinical events observed in past, such as past medication orders, past labs and their results, or past physiological signals can help us to anticipate the occurrence of a wide range of future events that would let health care practitioners intervene ahead of time or prepare resources to get ready for their occurrence. All of this can in turn improve the quality of patient care.

One of the challenges of modeling clinical event time series is their complexity, that is, clinical event time series for hospitalized patients may consist of thousands of different types of events corresponding to administration of many different medications, lab orders, arrivals of lab results, or various physiological observations, etc. This complexity may not fit very well standard Markov time series models [19] with either observed or hidden state and transition models.

To alleviate the event complexity problem we propose to develop a new more scalable event time series model based on the long-short-term-memory (LSTM) [14] that relies on two sources of information to predict future events. One source is derived from the set of recently observed clinical events. The other one is based on the hidden state space defined by the LSTM that aims to abstract past, more distant, patient information that is predictive of the future events. In the context of Markov state models, the next state in our models and the transition to the next state is defined by a combination of the recent state (most recent events) and the hidden state summarizing more distant past events.

In order to evaluate the proposed model, we use data derived from electronic health records (EHRs) of critical care patients in MIMIC-III dataset [16]. The clinical events considered in this work correspond to multiple types of events, such as medication administration events, lab test result events, physiological result events, and procedure events. These are combined together in a dynamically changing environment typical of intensive care units (ICUs) with patients suffering from severe life-threatening conditions.

Through extensive experiments on MIMIC-III data we show that our model outperforms multiple time series baselines in terms of the quality of event predictions. To provide further insights to its prediction performance we also divide the results with respect to different types of clinical events considered (medication, lab, procedure and physiological events), as well as, based on their repetition patterns, again showing the superior performance of our proposed model.

2 Related Work

2.1 Event-Time Series Models

The majority of discrete time-series models are based on Markov processes [24, 25]. Markov process models rely on Markov property that assumes that the state captures all necessary information relating future and past. In other words, the next state depends only on the most recent state, and is independent of the past states. In this case the joint distribution of an observed sequence is modeled as chain of conditional probabilities: \(p(y_1,y_2,..y_T) = p(y_1) \prod _{t=2}^{T} p(y_t|y_{t-1})\)

For Markov process models, the conditional probability defining a transition is parameterized by an \(e \times e\) transition matrix where e denotes all possible states: \(A_{i,j} = p(y_{t}=j | y_{t-1}=i)\). Standard Markov processes assume all states of the time series are directly observed. However, the states of many real-world processes are not directly observable. One way to resolve the problem is to define the state in terms of a limited number of past observations or features defined on past observations [11, 12, 31].

Hidden State Models. Another is to use Hidden Markov models (HMM) [29] that introduce hidden states \(z_{t}\) of some dimension d. Now the observations \(y_t\) is defined in terms of the hidden states and an \(e \times d\) emission table B with components: \(B_{i,j} = p(y_{t}=j | z_{t}=i)\). Briefly, the transition table A is used to update the hidden states and the emission table is used to generate observations.

HMM has been shown to reach good performance in many applications such as stock price prediction [10], DNA sequence analysis [15], and time-series clustering [28]. However, classic HMM model comes with drawbacks when applied to real-world time series: the hidden state space is discrete, and the transition model is restricted to transitions in between the discrete states. Linear dynamical models (LDS) [17] remedy some of the limitations by defining real-valued hidden state-space with linear transitions among the current and next hidden state. One problem with HMM and LDS models is that the dimensionality of their hidden state space is not known a priori. Various methods for hidden state space regularization, such as work by Liu and Hauskrecht [21, 22] for LDS have been able to address this problem.

Continuous Time Models. We would like to note that in addition to discrete time series models, the researchers have explored also methods permitting continuous time models. Examples are various version of Gaussian process models for predicting multivariate time series in continuous time, including those used for representing irregularly sampled clinical time series [20, 23].

Neural-Based Models. Recent advances in neural architectures and their application to time-series offer end-to-end learning framework that is often more flexible than standard time-series models. In neural-based approaches, the discrete time series are typically modeled using recurrent neural network (RNN) which provides a more flexible framework for modeling time-series. Similarly to HMM and LDS, RNN uses hidden states to abstract and carry information from past history but with more flexible hidden state defined by real-valued vectors and transition rules. At each time step, hidden state is updated given the previous time step’s hidden states and a new information from the current time step’s input. Although its limitations on vanishing and exploding gradient problems [13], its variants such as long short-term memory (LSTM) [14] unit and gated recurrent units (GRU) [2] allow wide adoptions in event time-series modeling. They have been applied to prediction and modeling time series [1, 9], vision [8], speech [7], and language [30] problems.

2.2 Clinical Event Time-Series Modeling

Modeling and prediction of discrete event time series in the healthcare area have been influenced greatly by advances in various neural architectures and deep learning. [3] used Skipgram [26] to represent and predict next visit in outpatient data. But they evaluated their model on the prediction task at the level of hospital visit, which can be of a very coarse granularity for real-world clinical applications that encompass event-specific time information. [4] modeled clinical time series with RNN and attention mechanism. However, the model is only able to perform binary classification on a whole-sequence level. Our model is able to predict fine-grained future event at the level of each time step of a sequence. [6] also used neural network models to predict the sequence of clinical events. In their approach, the patient pool was limited to patients with kidney failure and organ transplant. On the other hand, our model is tested and shows superior performances over baselines across general clinical time series that were not limited to a specific patient cohort.

3 Methodology

In this section, we first introduce state-space Markov and LSTM-based event time series models and then present our model combining the two models.

State-Space Markov Event Prediction. Given an observed events sequence \(\mathbf {y} = y_1, y_2, ... ,y_T\), we can model \(\mathbf {y}\) by defining a Markov transition model relating the current event state \(y_t\) with the next event states \(y_{t+1}\). In this case, we assume the event space is formed by a multivariate binary vector reflecting the occurrence of many different events (encoded as 1) over some time-window. One way to parameterize the transition between two consecutive event states is to use a transition matrix W with a bias vector b. As we want to predict multivariate binary vector, we can use sigmoid function \(\sigma (x)=\frac{1}{1+e^{(-x)}}\) as the output activation function:

$$\begin{aligned} \hat{y}_{t+1} = \sigma (W \cdot y_{t} + b) \end{aligned}$$
(1)

LSTM-Based Event Prediction. LSTM models are being successfully used to model time series with the help of hidden state vector, allowing one to summarize in the hidden state information from more distant past. At a glance, at each time step of a sequence, LSTM gets current (event) input and updates its hidden states. The hidden state then generates signals for the next hidden state, as well as, predictions for the occurrence of events in the next time-step.

In detail, at each time step t, events in the input sequence represented as multi-hot vector \(m_{t}\) is processed to a real-valued vector \(x_{t}\) through linear embedding matrix \(W^{emb}\): \(x_{t} = W^{(emb)} \cdot m_{t}\). Then, given processed input \({x_{t}}\) and previous hidden states \(h_{t-1}\), LSTM updates hidden states \(h_{t}\):

\(f_{t}\), \(i_{t}\), and \(o_{t}\) are forget, input and output gates and \(\otimes \) denotes element-wise multiplication. With these parameters ready, we can update hidden states:

$$\begin{aligned} h_{t} = \text {LSTM}(x_{t}, h_{t-1}) \end{aligned}$$

Future event occurrence prediction is generated through a fully-connected layer \(W^{q}\) with output activation function sigmoid:

$$\begin{aligned} \hat{y}_{t+1} = \sigma (W^{(fc)} \cdot h_{t} + b^{(fc)}) \end{aligned}$$
(2)

This parameterization links to the state space based event predictor. When \(y_t\) of Eq. 1 is replaced to hidden states \(h_{t}\), it becomes Eq. 2.

Recent Context-Aware LSTM-Based Event Predictor. When properly trained, hidden states in LSTM can be sufficient to represent and model future behaviors of event time-series by abstracting dependencies of past and future events. However, to be trained properly, LSTM (or any deep-learning based models) requires large amounts of training instances. In the clinical domain, obtaining large amounts of clinical cases (e.g., rarely ordered medication or lab tests) is hard in general. This constraint may deter us to train LSTM for predicting rare clinical cases. Meanwhile, for certain clinical event category such as medications, the future occurrence of an event may highly depend on recent previous or current occurrence of the event type and incorporating this information may help to resolve the data deficiency constraint.

Therefore, to address the problem, we propose and develop an adaptive mechanism that refers to both abstracted information of past sequence through hidden states of LSTM and concrete information about event occurrences in very recent context window. Different from the preliminary LSTM-based output generation in Eq. 2 that only depends on abstracted hidden states of LSTM, we directly refer to recent event occurrence information. The recent event at the current time step t is in multi-hot vector \(m_t\) and it is incorporated into the model through a linear transformation to model:

$$ b^{(u)} = W^{(s)} \cdot m_{t} + b^{(s)} $$

\(b^{(s)}\) can be seen as additional bias term that reflects recent event occurrence information and final prediction for event occurrence is made as follows:

$$\begin{aligned} \hat{y}_{t+1} = \sigma (W^{(fc)} \cdot h_{t} + b^{(fc)} + b^{(u)}) \end{aligned}$$

The proposed predictor also can be seen as combining the LSTM based predictor with state-space based Markov predictor. Especially, in context of Markov state models, the next state in our models and the transition to the next state is defined by a combination of the recent state (most recent events) and the hidden state summarizing more distant past events.

Loss Function. To measure the performance of the event prediction, \(\mathcal {L}\) is defined as binary cross entropy between label vector \(y_{t}\) and prediction vector \(\hat{y}_{t}\) over all sequences in the training set and 1 denotes a vector filled with 1s:

$$\begin{aligned} \mathcal {L} = \sum _{t} - [y_{t} \cdot \log \hat{y}_{t} + (\mathbf 1 - y_{t}) \cdot \log (\mathbf 1 - \hat{y}_{t})] \end{aligned}$$

Parameter Learning. The parameters of the model is learned by back propagation through time (BPTT) [32] with adaptive stochastic gradient descent based optimizer [18]. Hyper-parameters are tuned by F1-score performances on validation set with following ranges: embedding (\(W^{(emb)}\)) size in \(\{128, 256, 512\}\); hidden states size in \(\{512, 1024, 2048\}\) and learning rate \(= 0.005\) batch size \(= 512\). To prevent over-fitting, early stopping and dropout (\(p=0.5\)) are applied.

4 Experimental Evaluation

4.1 Clinical Data

We test the proposed model on MIMIC-III, a clinical database generated from real-world EHRs of intensive care unit patients [16]. We extract 21,897 patients whose records are generated from Meta Vision system that is one of the systems used to create records in the MIMIC-III database. We extract patient in age between 18 and 99 and whose length of stay in ICU is between 3 and 20 days. We randomly split patients into the train, test, and validation sets with the ratio of 7:2:1 and generate multivariate event time-series by segmenting sequences with both input-window and future window with size \(W={24}\). At the end of each input-window, its future-window is generated.

We consider the following types of events in our models: medication administration events, lab results events, procedure events, and physiological result events. Medication administration events indicate records of specific kind of medication administered to the patient. Lab results events indicate lab test and its results represented as normal, abnormal-high, or abnormal-low. Procedure events indicate records of procedures patient received during hospitalization. For medication, lab, and procedure event categories, we select those events observed in more than 100 different patients. Physiological result events consist of 23 cardiovascular, routine vital signs, respiratory, and hemodynamics signals selected by a critical care expert. Similarly to the lab result events, numeric physiological results are discretized to normal, abnormal-high, and abnormal-low. Table 1 shows the basic data statistics.

Table 1. Clinical data statistics by event categories

4.2 Evaluation Metrics

We evaluate the quality of time series predictions using area under precision-recall curve (AUPRC) and area under the receiver operating characteristic curve (AUROC). Although AUROC is commonly used to present result for binary classification problems, it can provide misleading information when applied to highly imbalanced dataset. On the other hand, AUPRC provides more accurate profile on performances of models under such circumstances [5, 27]. As shown in Table 1, our dataset is severely skewed to negative examples. Therefore, we use AUPRC as our primary evaluation measurement.

4.3 Baseline Models

We compare our proposed model to the dense logistic regression models defined upon the following inputs (predictors):

Current Markov state (Markov) as defined in Eq. 1.

Binary History (LR-binary): Unlike the current Markov state information, this model considers the occurrence of all past events (not just the most recent one) and encodes them into one multi-hot vector.

Count History (LR-count): This model, similarly to Binary history, summarizes all past events (not just the most recent ones), but instead of multi-hot vector representation it uses a vector of event counts.

Current LSTM state (LSTM): The model uses the hidden state of the LSTM to summarize information from distant past important for prediction.

4.4 Results

All our evaluations were performed on the test set, that was not touched during the training and validation steps. Prediction results in Fig. 1 summarize the performance of our model and baselines on 24-h prediction window. The results show that our model outperforms all baselines in terms of both AUROC and AUPRC statistics. Moreover, the Markov state model is better than pure LSTM in terms of AUPRC. This shows the information from the most recent time window is most of the time the most important source for predicting the next step events. This is not surprising given the fact that many events (such as drug administrations or lab orders) are repeated every 24-h, hence once they are observed they are most likely to occur also in the next time window.

Fig. 1.
figure 1

Overall time-series prediction results on the 24-h window segmentation

To verify the above reasoning, and to provide further insights into the predictive performance of our models, we break the above results by considering separately predictions when the same events occurred in the previous time step and when they did not. We refer to these as to repetitive and non-repetitive patterns. The results are given in Fig. 2. From the results, we can clearly see that predicting non-repetitive events is significantly more difficult than predicting repetitive ones. However, despite this, we can also see that our model consistently outperforms other baselines across both repetitive and non-repetitive scenarios. Remarkably for non-repetitive event prediction, our model’s AUROC is 32% higher than average of all baseline models in AUPRC and 11% in AUROC.

Fig. 2.
figure 2

Prediction results on repetitive and non-repetitive events

To analyze our results further, we next break the evaluation down by inspecting predictive performances of the models for the different event categories. The results are shown in Fig. 3. Clearly, our model consistently outperforms baseline models across all event categories in both AUROC and AUPRC statistics.

So far, all our results were obtained by considering the window size of 24 h. Next, we investigate the predictive performance of the models by varying the prediction window size. More specifically we will consider the window size W of length 6, 12 and 24 h. Due to space limits, we will consider and compare the methods only using AUPRC statistics. As shown in Fig. 4, our model shows superior performance across all time-resolutions.

Fig. 3.
figure 3

Prediction results by the event type category

Fig. 4.
figure 4

AUPRC prediction statistics for the different window sizes

To dig deeper into the time segmentation results, in Fig. 5 we show the predictive performance of lab test and physiological result events. We can see that on lab test event prediction, our model dominates at larger window sizes (\(W=12, 24\)): it outperforms baseline models by 27%. In smaller window size (\(W=6\)), the LSTM performs slightly better than ours by 2%. On physiological event prediction, our model surpasses all baselines across all time resolutions.

Interestingly, on lab event prediction, overall predictability is high at \(W=24\) and deteriorates for smaller window sizes. This reflects the recurrent characteristic of lab events at a cycle of 24 h, that is, lab tests and their results are ordered and observed most of the time once daily. Inversely, the overall predictability of physiological events decreases with increasing window length. It indicates a recurrent characteristic of clinical events but in different recurring interval that is much shorter. Most physiological result events are automatically generated from bedside monitoring devices at short intervals, typically at a scale of seconds to minutes. Therefore, the variability of observation on a time series generated from smaller windows should be less than those of larger windows. Hence, overall predictability on smaller time resolution is consistently higher than larger ones as seen in Fig. 5.

Fig. 5.
figure 5

Prediction result for lab and physiological events for the different window sizes

5 Conclusion

In this work, we show the importance of two sources of information for event-time series modeling. One source is derived from the set of recently observed clinical events and the other is based on the hidden states of LSTM that aims to abstract past, more distant, patient information that is predictive of future events. We show that the combination of the two sources of information implemented in our method leads to improved prediction performance on MIMIC-III clinical event data when compared to models that rely only on individual sources.