Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Health care costs are escalating. To deliver cost effective quality care, modern health systems are turning to data to predict risk and adverse events. For example, identifying patients with high risk of readmission can help hospitals to tailor suitable care packages.

Modern electronic medical records (EMRs) offer the base on which to build prognostic systems [11, 15, 19]. Such inquiry necessitates modeling patient-level temporal healthcare processes. But this is challenging. The records are a mixture of the illness trajectory, and the interventions and complications. Thus medical records vary in length, are inherently episodic and irregular over time. There are long-term dependencies in the data - future illness and care may depend critically on past illness and interventions. Existing methods either ignore long-term dependencies or do not adequately capture variable length [1, 15, 19]. Neither are they able to model temporal irregularity [14, 20, 22].

Addressing these open problems, we introduce DeepCare, a deep, dynamic neural network that reads medical records, infers illness states and predicts future outcomes. DeepCare has several layers. At the bottom, we start by modeling illness-state trajectories and healthcare processes [2, 7] based on Long Short-Term Memory (LSTM) [5, 9]. LSTM is a recurrent neural network equipped with memory cells, which store previous experiences. The current medical risk states are modeled as a combination of illness memory and the current medical conditions and are moderated by past and current interventions. The illness memory is partly forgotten or consolidated through a mechanism known as forget gate. The LSTM can handle variable lengths with long dependencies making it an ideal model for diverse sequential domains [6, 17, 18]. Interestingly, LSTM has never been used in healthcare. This may be because one major difficulty is the handling irregular time and interventions.

We augment LSTM with several new mechanisms to handle the forgetting and consolidation of illness through the memory. First, the forgetting and consolidation mechanisms are time moderated. Second, interventions are modeled as a moderating factor of the current risk states and of the memory carried into the future. The resulting model is sparse and efficient where only observed records are incorporated, regardless of the irregular time spacing. At the second layer of DeepCare, episodic risk states are aggregated through a new time-decayed multiscale pooling strategy. This allows further handling of time-modulated memory. Finally at the top layer, pooled risk states are passed through a neural network for estimating future prognosis. In short, computation steps in DeepCare can be summarized as:

$$\begin{aligned} P\left( y\mid {{\varvec{x}}}_{1:n}\right) =P\left( \text {nnet}_{y}\left( \text {pool}\left\{ \text {LSTM}({{\varvec{x}}}_{1:n})\right\} \right) \right) \end{aligned}$$
(1)

where \({{\varvec{x}}}_{1:n}\) is the input sequence of admission observations, y is the outcome of interest (e.g., readmission), \(\text{ nnet }_{y}\) denotes estimate of the neural network with respect to outcome y, and P is probabilistic model of outcomes.

We demonstrate our DeepCare in answering a crucial component of the holy grail question “what happens next?”. In particular, we predict the next stage of disease progression and the risk of unplanned readmission for diabetic patients after a discharge from hospital. Our cohort consists of more than 12,000 patients whose data were collected from a large regional hospital in the period of 2002 to 2013. The forecasting of future events may be considerably harder than the classical classification of objects into categories due to inherent uncertainty in unseen interleaved events. We show that DeepCare is well-suited for modeling disease progression, as well as predicting future risk.

To summarize, our main contributions are: (i) Introducing DeepCare, a deep dynamic neural network for medical prognosis. DeepCare models irregular timing and interventions within LSTM – a powerful recurrent neural networks for sequences and (ii) Demonstrating the effectiveness of DeepCare for disease progression modeling and medical risk prediction, and showing that it outperforms baselines.

2 Long Short-Term Memory

This section briefly reviews Long Short-Term Memory (LSTM), a recurrent neural network (RNN) for sequences. A LSTM is a sequence of units that share the same set of parameters. Each LSTM unit has a memory cell that has state \({{\varvec{c}}}_{t}\in \mathbb {R}^{K}\) at time t. The memory is updated through reading a new input \({{\varvec{x}}}_{t}\in \mathbb {R}^{M}\) and the previous output \({{\varvec{h}}}_{t-1}\in \mathbb {R}^{K}\). Then an output states \({{\varvec{h}}}_{t}\) is written based on the memory \({{\varvec{c}}}_{t}\). There are 3 sigmoid gates that control the reading, writing and memory updating: input gate \({{\varvec{i}}}_{t}\), output gate \({{\varvec{o}}}_{t}\) and forget gates \({{\varvec{f}}}_{t}\), respectively. The gates and states are computed as follows:

$$\begin{aligned} {{\varvec{i}}}_{t}= & {} \sigma \left( W_{i}{{\varvec{x}}}_{t}+U_{i}{{\varvec{h}}}_{t-1}+{{\varvec{b}}}_{i}\right) \end{aligned}$$
(2)
$$\begin{aligned} {{\varvec{f}}}_{t}= & {} \sigma \left( W_{f}{{\varvec{x}}}_{t}+U_{f}{{\varvec{h}}}_{t-1}+{{\varvec{b}}}_{f}\right) \end{aligned}$$
(3)
$$\begin{aligned} {{\varvec{o}}}_{t}= & {} \sigma \left( W_{o}{{\varvec{x}}}_{t}+U_{o}{{\varvec{h}}}_{t-1}+{{\varvec{b}}}_{o}\right) \end{aligned}$$
(4)
$$\begin{aligned} {{\varvec{c}}}_{t}= & {} {{\varvec{f}}}_{t}*{{\varvec{c}}}_{t-1}+{{\varvec{i}}}_{t}*\text{ tanh }\left( W_{c}{{\varvec{x}}}_{t}+U_{c}{{\varvec{h}}}_{t-1}+{{\varvec{b}}}_{c}\right) \end{aligned}$$
(5)
$$\begin{aligned} {{\varvec{h}}}_{t}= & {} {{\varvec{o}}}_{t}*\text{ tanh }({{\varvec{c}}}_{t}) \end{aligned}$$
(6)

where \(\sigma \) denotes sigmoid function, \(*\) denotes element-wise product, and \(W_{i,f,o,c}\), \(U_{i,f,o,c}\), \({{\varvec{b}}}_{i,f,o,c}\) are parameters. The gates have the values in (0, 1).

The memory cell plays a crucial role in memorizing past experiences. The key is the additive memory updating in Eq. (5): if \({{\varvec{f}}}_{t}\rightarrow \mathbf {1}\) then all the past memory is preserved. Thus memory can potentially grow overtime since new experience is stilled added through the gate \({{\varvec{i}}}_{t}\). If \({{\varvec{f}}}_{t}\rightarrow \mathbf {0}\) then only new experience is updated (memoryless). An important property of additivity is that it helps to avoid a classic problem in standard recurrent neural networks known as vanishing/exploding gradients when t is large (says, greater than 10).

LSTM for Sequence Labeling. The output states \({{\varvec{h}}}_{t}\) can be used to generate labels at time t as follows:

$$\begin{aligned} P\left( y_{t}=l\mid {{\varvec{x}}}_{1:t}\right) =\text{ softmax }\left( {{\varvec{v}}}_{l}^{\top }{{\varvec{h}}}_{t}\right) \end{aligned}$$
(7)

for label specific parameters \({{\varvec{v}}}_{l}\).

LSTM for Sequence Classification. LSTM can be used for classification using a simple mean-pooling strategy over all output states coupled with a differentiable loss function. For example, in the case of binary outcome \(y\in \{0,1\}\), we have:

$$\begin{aligned} P\left( y=1\mid {{\varvec{x}}}_{1:n}\right) =\text{ LR }\left( \text{ pool }\left\{ \text{ LSTM }({{\varvec{x}}}_{1:n})\right\} \right) \end{aligned}$$
(8)

where \(\text{ LR }\) denotes probability estimate of the logistic regression, and \(\text {pool}\left\{ {{\varvec{h}}}_{1:n}\right\} =\frac{1}{n}\sum _{t=1}^{n}{{\varvec{h}}}_{t}\).

Fig. 1.
figure 1

DeepCare architecture. The bottom layer is Long Short-Term Memory [9] with irregular timing and interventions (see also Fig. 2b)

3 DeepCare: A Deep Dynamic Memory Model

In this section we present our contribution named DeepCare for modeling illness trajectories and predicting future outcomes. As illustrated in Fig. 1, DeepCare is a deep dynamic neural network that has three main layers. The bottom layer is built on LSTM whose memory cells are modified to handle irregular timing and interventions. More specifically, the input is a sequence of admissions. Each admission t contains a set of diagnosis codes (which is then formulated as a feature vector \({{\varvec{x}}}_{t}\in \mathbb {R}^{M}\)), a set of intervention codes (which is then formulated as a feature vector \({{\varvec{p}}}_{t}\)), the admission method \(m_{t}\) and the elapsed time \(\varDelta t\in \mathbb {R}^{+}\) between the two admission t and \(t-1\). Denote by \({{\varvec{u}}}_{0},{{\varvec{u}}}_{1},\ldots ,{{\varvec{u}}}_{n}\) the input sequence, where \({{\varvec{u}}}_{t}=[{{\varvec{x}}}_{t},{{\varvec{p}}}_{t},m_{t},\varDelta t]\), the LSTM computes the corresponding sequence of distributed illness states \({{\varvec{h}}}_{0},{{\varvec{h}}}_{1},\ldots ,{{\varvec{h}}}_{n}\), where \({{\varvec{h}}}_{t}\in \mathbb {R}^{K}\). The middle layer aggregates illness states through multiscale weighted pooling \({{\varvec{z}}}=\text{ pool }\left\{ {{\varvec{h}}}_{0},{{\varvec{h}}}_{1},\ldots ,{{\varvec{h}}}_{n}\right\} \), where \({{\varvec{z}}}\in \mathbb {R}^{K\times s}\) for s scales.

The top layer is a neural network that takes pooled states and other statistics to estimate the final outcome probability, as summarized in Eq. (1) as \(P\left( y\mid {{\varvec{x}}}_{1:n}\right) =P\left( \text{ nnet }_{y}\left( \text{ pool }\left\{ \text{ LSTM }({{\varvec{x}}}_{1:n})\right\} \right) \right) \). The probability \(P\left( y\mid {{\varvec{x}}}_{1:n}\right) \) depends on the nature of outputs and the choice of statistical structure. For example, for binary outcome, \(P\left( y=1\mid {{\varvec{x}}}_{1:n}\right) \) is a logistic function; for multiclass outcome, \(P\left( y\mid {{\varvec{x}}}_{1:n}\right) \) is a softmax function; and for continuous outcome, \(P\left( y\mid {{\varvec{x}}}_{1:n}\right) \) is Gaussian. In what follows, we describe the first two layers in more detail.

Fig. 2.
figure 2

(a) Admission embedding. Discrete diagnoses and interventions are embedded into 2 vectors \({{\varvec{x}}}_{t}\) and \({{\varvec{p}}}_{t}\). (b) Modified LSTM unit as a carrier of illness history. Compared to the original LSTM unit (Sect. 2), the modified unit models times, admission methods, diagnoses and intervention

3.1 Admission Embedding

Figure 2a illustrates the admission embedding. There are two main types of information recorded in a typical EMR: (i) diagnoses of current condition; and (ii) interventions. Diagnoses are represented using WHO’s ICD (International Classification of Diseases) coding schemesFootnote 1. Interventions include procedures and medications. The procedures are typically coded in CPT (Current Procedural Terminology) or ICHI (International Classification of Health Interventions) schemes. Medication names can be mapped into the ATC (Anatomical Therapeutic Chemical) scheme. These schemes are hierarchical and the vocabularies are of tens of thousands in size. Thus for a problem, a suitable coding level should be used for balancing between specificity and robustness.

Codes are first embedded into a vector space of size M and embedding is learnable. Since each admission typically consists of multiple diagnoses, we average all the present vectors to derive \({{\varvec{x}}}_{t}\in \mathbb {R}^{M}\). Likewise, we derive the averaged intervention vector \({{\varvec{p}}}_{t}\in \mathbb {R}^{M}\). Finally, an admission embedding is a 2M-dim vector \(\left[ {{\varvec{x}}}_{t},{{\varvec{p}}}_{t}\right] \).

3.2 Moderating Admission Method and Effect of Interventions

There are two main types of admission: planned and unplanned. Unplanned admission refers to transfer from emergency attendance, which typically indicate higher risk. Recall from Eqs. (2, 5) that the input gate \({{\varvec{i}}}\) control how much new information is updated into memory \({{\varvec{c}}}\). The gate can be modified to reflect the risk level of admission type as follows:

$$\begin{aligned} {{\varvec{i}}}_{t}=\frac{1}{m_{t}}\sigma \left( W_{i}{{\varvec{x}}}_{t}+U_{i}{{\varvec{h}}}_{t-1}+{{\varvec{b}}}_{i}\right) \end{aligned}$$
(9)

where \(m_{t}=1\) if emergency admission, \(m_{t}=2\) if routine admission.

Since interventions are designed to cure diseases or reduce patient’s illness, the output gate is moderated by the current intervention as follows:

$$\begin{aligned} {{\varvec{o}}}_{t}=\sigma \left( W_{o}{{\varvec{x}}}_{t}+U_{o}{{\varvec{h}}}_{t-1}+P_{o}{{\varvec{p}}}_{t}+{{\varvec{b}}}_{o}\right) \end{aligned}$$
(10)

Interventions may have long-term impacts than just reducing the current illness. This suggests the illness forgetting is moderated by previous intervention

$$\begin{aligned} {{\varvec{f}}}_{t}=\sigma \left( W_{f}{{\varvec{x}}}_{t}+U_{f}{{\varvec{h}}}_{t-1}+P_{f}{{\varvec{p}}}_{t-1}+{{\varvec{b}}}_{f}\right) \end{aligned}$$
(11)

where \({{\varvec{p}}}_{t-1}\)is intervention at time step \(t-1\).

3.3 Capturing Time Irregularity

We introduce two mechanisms of forgetting the memory by modified the forget gate \({{\varvec{f}}}_{t}\) in Eq. 11:

Time Decay. Recall that the memory cell holds the current illness states, and the illness memory can be carried on to the future time. There are acute conditions that naturally reduce their effect through time. This suggests a simple decay

$$\begin{aligned} {{\varvec{f}}}_{t}\leftarrow d(\varDelta _{t-1:t}){{\varvec{f}}}_{t} \end{aligned}$$
(12)

where\(\varDelta _{t-1:t}\) is the time passed between step \(t-1\) and step t, and \(d\left( \varDelta _{t-1:t}\right) \in (0,1]\) is a decay function, i.e., it is monotonically non-increasing in time. One function we found working well is \(d(\varDelta _{t-1:t})=\left[ \log (e+\varDelta _{t-1:t})\right] ^{-1}\), where \(e\approx 2.718\) is the base of the natural logarithm.

Forgetting Through Parametric Time. Time decay may not capture all conditions, since some conditions can get worse, and others can be chronic. This suggests a more flexible parametric forgetting:

$$\begin{aligned} {{\varvec{f}}}_{t}=\sigma \left( W_{f}{{\varvec{x}}}_{t}+U_{f}{{\varvec{h}}}_{t-1}+Q_{f}{{\varvec{q}}}_{\varDelta _{t-1:t}}+P_{f}{{\varvec{p}}}_{t-1}+{{\varvec{b}}}_{f}\right) \end{aligned}$$
(13)

where \({{\varvec{q}}}_{\varDelta _{t-1:t}}\) is a vector derived from the time difference \(\varDelta _{t=1:t}\). For example, we may have: \({{\varvec{q}}}_{\varDelta _{t-1:t}}=\left( \varDelta _{t-1:t},\varDelta _{t-1:t}^{2},\varDelta _{t-1:t}^{3}\right) \) to model the third-degree forgetting dynamics.

3.4 Recency Attention via Multiscale Pooling

Once the illness dynamics have been modeled using the memory LSTM, the next step is to aggregate the illness states to infer about the future prognosis. The simplest way is to use mean-pooling, where \(\bar{{{\varvec{h}}}}=\text{ pool }\left\{ {{\varvec{h}}}_{1:n}\right\} =\frac{1}{n}\sum _{t=1}^{n}{{\varvec{h}}}_{t}\). However, this does not reflect the attention to recency in healthcare. Here we introduce a simple attention scheme that weighs recent events more than old ones: \(\bar{{{\varvec{h}}}}=\left( \sum _{t=t_{0}}^{n}w_{t}{{\varvec{h}}}_{t}\right) /\sum _{t=t_{0}}^{n}w_{t},\) where

$$\begin{aligned} w_{t}= & {} \left[ m_{t}+\text{ log }\left( 1+\varDelta _{t:n}\right) \right] ^{-1} \end{aligned}$$

and \(\varDelta _{t:n}\) is the elapsed time between the step t and the current step n, measured in months; \(m_{t}=1\) if emergency admission, \(m_{t}=2\) if routine admission. The starting time step \(t_{0}\) is used to control the length of look-back in the pooling, for example, \(\varDelta _{t_{0}:n}\le 12\) for one year look-back. Since diseases progress at different rates for different patients, we employ multiple look-backs: 12 months, 24 months, and all available history. Finally, the three pooled illness states are stacked into a vector: \(\left[ \bar{{{\varvec{h}}}}_{12},\bar{{{\varvec{h}}}}_{24},\bar{{{\varvec{h}}}}_{all}\right] \) which is then fed to a neural network for inferring about future prognosis.

3.5 Learning

Learning is carried out through minimizing cross-entropy: \(L=-\log P\left( y\mid {{\varvec{x}}}_{1:n}\right) \), where \(P\left( y\mid {{\varvec{x}}}_{1:n}\right) \) is given in Eq. (1). For example, in the case of binary classification, \(y\in \{0,1\}\), we use logistic regression to represent \(P\left( y\mid {{\varvec{x}}}_{1:n}\right) \), i.e.,

$$\begin{aligned} P\left( y=1\mid {{\varvec{x}}}_{1:n}\right) =\sigma \left( b_{y}+\text {nnet}\left( \text {pool}\left\{ \text {LSTM}({{\varvec{x}}}_{1:n})\right\} \right) \right) \end{aligned}$$

where the structure inside the sigmoid is given in Eq. (1). The cross-entropy becomes: \(L=-y\log \sigma -(1-y)\log (1-\sigma )\). Despite having a complex structure, DeepCare’s loss function is fully differentiable, and thus can be minimized using standard back-propagation. The details are omitted due to space constraint.

Fig. 3.
figure 3

Top row: Data statistics (y axis: number of patients; x axis: (a) age, (b) number of admissions, (c) number of days); Bottom row: Progression from pre-diabetes (upper diag. cloud) to post-diabetes (lower diag. cloud).

4 Experiments

4.1 Data

The dataset is a diabetes cohort of more than 12,000 patients (55.5 % males, median age 73) collected in a 12 year period 2002–2013 from a large regional Australian hospital. Data statistics are summarized in Fig. 3. The diagnoses are coded using ICD-10 scheme. For example, E10 is diabetes Type I, and E11 is diabetes Type II. Procedures are coded using the ACHI (Australian Classification of Health Interventions) scheme, and medications are mapped in ATC codes. We preprocessed by removing (i) admissions with missing key information; and (ii) patients with less than 2 admissions. This leaves 7,191 patients with 53,208 admissions. To reduce the vocabulary, we collapse diagnoses that share the first 2 characters into one diagnosis. Likewise, the first digits in the procedure block are used. In total, there are 243 diagnosis, 773 procedure and 353 medication codes.

4.2 Implementation

The training, validation and test sets are created by randomly picking 2 / 3, 1 / 6, 1 / 6 data points, respectively. We vary the embedding and hidden dimensions from 5 to 50 but the results are rather robust. We report results for \(M=30\) embedding dimensions and \(K=40\) hidden units. Learning is by SGD with mini-batch of 16. Learning rate starts at 0.01. After \(n_{waiting}\) epochs, if the model cannot find smaller training cost since the epoch with smallest training cost, the learning rate is divided by 2. At first, \(n_{waiting}=5\), then updated as \(n_{waiting}=\text{ min }\left\{ 15,n_{waiting}+2\right\} \) for each halving. Learning is terminated after \(n_{epoch}=200\) or after learning rate smaller than \(\epsilon =0.0001\).

4.3 Modeling Disease Progression

We first verify that the recurrent memory embedded in DeepCare is a realistic model of disease progression. We use the bottom layer of DeepCare (Sects. 3.13.3) to predict the next \(n_{pred}\) diagnoses at each discharge using Eq. (7).

Table 1 reports the Precision@\(n_{pred}\). The Markov model has memoryless disease transition probabilities \(P\left( d_{t}^{i}\mid d_{t+1}^{j}\right) \) from disease \(d^{j}\) to \(d^{i}\) at time t. Given an admission with disease subset \(D_{t}\), the next disease probability is estimated as \(Q\left( d^{i};t\right) =\frac{1}{\left| D_{t}\right| }\sum _{j\in D_{t}}P\left( d_{t}^{i}\mid d_{t+1}^{j}\right) \). Using plain RNN improves over memoryless Markov model by \(8.8\,\%\) with \(n_{pred}=1\) and by \(27.7\,\%\) with \(n_{pred}=3\). Modeling irregular timing and interventions in DeepCare gains a further \(2\,\%\) improvement.

Table 1. Precision@\(n_{pred}\) diagnoses prediction.

4.4 Predicting Unplanned Readmission

Next we demonstrate DeepCare on risk prediction. For each patient, a discharge is randomly chosen as prediction point, from which unplanned readmission after 12 months will be predicted. Baselines are SVM and Random Forests running on standard non-temporal features engineering using one-hop representation of diagnoses and intervention codes. Then pooling is applied to aggregate over all existing admissions for each patient. Two pooling strategies are tested: max and sum. Max-pooling is equivalent to the presence-only strategy in [1], and sum-pooling is akin to an uniform convolutional kernel in [20]. This feature engineering strategy is equivalent to zeros-forgetting – any risk factor occurring in the past is memorized.

Fig. 4.
figure 4

(Left) 40 channels of forgetting due to time elapsed. (Right) The forget gates of a patient in the course of their illness.

Dynamics of Forgetting. Figure 4(left) plots the contribution of time into the forget gate. The contributions for all 40 states are computed using \(Q_{f}{{\varvec{q}}}_{\varDelta _{t}}\) as in Eq. (13). There are two distinct patterns: decay and growing. This suggests that the time-based forgetting has a very small dimensionality, and we will under-parameterize time using decay only as in Eq. (12), and over-parameterize time using full parameterization as in Eq. (13). A right balance is interesting to warrant a further investigation. Figure 4(right) shows the evolution of the forget gates through the course of illness (2000 days) for a patient.

Prediction Results. Table 2 reports the F-scores. The best baseline (non-temporal) is Random Forests with sum pooling has a F-score of 71.4 % [Row 4]. Using LSTM with simple mean-pooling and logistic regression already improves over best non-temporal methods by a 4.5 % difference in 12-months prediction [Row 5, ref: Sect. 2]. Moving to deep models by using a neural network as classifier helps with a gain of 5.1 % improvement [Row 6, ref: Eq. (1)]. By carefully modelling the irregular timing, interventions and recency \(+\) multiscale pooling, we gain 5.7 % improvement [Row 7, ref: Sects. 3.2, 3.3]. Finally, with parametric time we arrive at 79.1 % F-score, a 7.7 % improvement over the best baselines [Row 8, ref: Sects. 3.2, 3.3].

Table 2. Results of unplanned readmission prediction within 12 months.

5 Related Work and Discussion

Electronic medical records (EMRs) are the results of interleaving between the illness processes and care processes. Using EMRs for prediction has attracted a significant interest in recent year [11, 19]. However, most existing methods are either based on manual feature engineering [15], simplistic extraction [20], or assuming regular timing as in dynamic Bayesian networks [16]. Irregular timing and interventions have not been adequately modeled. Nursing illness trajectory model was popularized by Strauss and Corbin [2, 4], but the model is qualitative but imprecise in time [7]. Thus its predictive power is very limited. Capturing disease progression has been of great interest [10, 14], and much effort has been spent on Markov models [8, 22]. However, healthcare is inherently non-Markovian due to the long-term dependencies. For example, a routine admission with irrelevant medical information would destroy the illness memory [1], especially for chronic conditions.

Deep learning is currently at the center of a new revolution in making sense of a large volume of data. It has achieved great successes in cognitive domains such as vision and NLP [12]. To date, deep learning approach to healthcare has been an unrealized promise, except for several very recent work [3, 13, 21], where irregular timing is not property modeled. We observe that there is a considerable similarity between NLP and EMR, where diagnoses and interventions play the role of nouns and modifiers, and an EMR is akin to a sentence. A major difference is the presence of precise timing in EMR, as well as the episodic nature. Our DeepCare contributes along that line.

DeepCare is generic and it can be implemented on existing EMR systems. For that more extensive evaluations on a variety of cohorts, sites and outcomes will be necessary. This offers opportunities for domain adaptations through parameter sharing among multiple cohorts and hospitals.

6 Conclusion

In this paper we have introduced DeepCare, a deep dynamic memory neural network for personalized healthcare. In particular, DeepCare supports prognosis from electronic medical records. DeepCare contributes to the healthcare model literature introducing the concept of illness memory into the nursing model of illness trajectories. To achieve precision and predictive power, DeepCare extends the classic Long Short-Term Memory by (i) parameterizing time to enable irregular timing, (ii) incorporating interventions to reflect their targeted influence in the course of illness and disease progression; (iii) using multiscale pooling over time; and finally (iv) augmenting a neural network to infer about future outcomes. We have demonstrated DeepCare on predicting next disease stages and unplanned readmission among diabetic patients. The results are competitive against current state-of-the-arts. DeepCare opens up a new principled approach to predictive medicine.