Keywords

1 Introduction

The traditional model of clinical practice incorporates diagnosis, prognosis, and treatment. Diagnosis is fundamental to the practice of medicine and mastery of it is central to the process of both becoming and practicing as a doctor. Moreover, the activity of diagnosis is central to the practice of medicine, and has, to date, received the focused medical and computational science attention which many have argued it warrants [3]. This is beginning to be outburst with an emergent computer-aided diagnosis, which seeks to explore the activity and its outcomes as a prism through which many issues are played out [14]. It is argued that diagnosis serves many functions for patients, clinicians, and wider society [14], and can be understood both as a category and a process [3]. Diagnosis classifies the sick patient as having or not having a particular disease. Historically, the diagnosis was regarded as the primary guide to treatment and prognosis (“what is likely to happen in the future”), and this is still considered the core component of clinical practice [8].

Intensive care refers to the specialized treatment given to patients who are acutely unwell and require critical medical care. Moreover, an Intensive Care Unit (ICU) provides the critical care and life support for acutely ill and injured patients. The ICU is one of the most critically functioning operational environments in a hospital. To healing ICU patients, the clinicians need to actions in a remarkably short period. However, intensivists depend upon a large number of measurements to make daily decisions in the ICU. However, the reliability of these measures may be jeopardized by the effects of therapy [18]. Moreover, in critical illness, what is normal is not necessarily optimal. Diagnosis as the initial step of this medical practice is one of the most important parts of complicated clinical decision making [1].

With Electronic Health Records (EHR) growth in biomedical and healthcare communities, it is possible to use bedside computer-aided diagnosis to accurate analysis of medical data, which can greatly benefit the ICU disease diagnosis as well as patient care, and community services. However, the existing work has focused on specialized predictive models that predict a limited set of disease. Such as Long et al. use the IT2FLS model to diagnosis heart disease [17], Jiri PolivkaJr et al. tried to find the mystery of the brain metastatic disease [22], Chaurasia et al. [4] use data mining techniques to detect breast cancer and Nilashi et al. [20] use neuro-fuzzy technique for hepatitis disease diagnosis. However, the day-to-day clinical practice involves an unscheduled and heterogeneous mix of scenarios and needs different prediction models in the hundreds to thousands [7]. It is impractical to develop and deploy specialized models one by one.

Fig. 1.
figure 1

Complication distribution of patients in MIMIC-III.

As shown in Fig. 1, this is the complication distribution of patients in the Medical Information Mart for Intensive Care (MIMIC-III) [12]. We noticed that the vast majority of patients in the ICU are diagnosed with more than one diseases, that is to say, most of the patients have 5 to 20 complications. Moreover, the human body as organic entities and different systems are closely connected, and no diseases are isolated. In considering this, to establish a single model to diagnosis the majority of the diseases, we designed a multi-source multi-task attention [30] model for ICU diagnosis. The sources refer the different clinical measurements and the medical treatment, and the tasks refer the diagnose of different diseases, the detailed description will in the section of Problem Definition. To the best of our knowledge, this is the first time that to utilizing the shared feature space from different disease to boost the diagnose performance.

The focus of this paper is upon diagnosis as a process, we put the diagnosis into a temporal sequence and treated it as a step-by-step process, in particular from the perspective of the EHR data streaming. We conduct our experiment on real-world MIMIC-III benchmark dataset, and the result shows that our model is highly competitive and outperforms the state-of-the-art traditional methods and commonly used deep learning methods. Furthermore, we evaluated our model on 9 human systems over 50 different kinds of diseases.

The main contributions of this work are summarized as follow:

  • Multiple Perspectives for Disease Formulation. We formulate ICU disease diagnose as a multi-source and multi-task learning problem, where sources correspond to clinical measurements and medical treatment, tasks correspond to the diagnosis of each disease. This work enables us to use a straightforward model to handle different kinds of diseases over all categories.

  • Diagnosis Step by Step. For the first time, we treat the disease diagnosis as a gradual process over the observations along the temporal measure and treat sequence as well as the complications.

  • A Novel Integrated Model to diagnose the majority of the disease. We designed a model DMMAM integrated with the input embedding, window alignment, attention mechanisms, and focal loss functions.

  • Comprehensive Evaluated Experiments. We conduct experiment on MMIC-III benchmark dataset on 50 diseases over 9 categories, which covers most of the commonly diseases. The results demonstrate that our method is effective, competitive and can achieve state-of-the-art performance.

The remainder of this paper is organized as follows. We present a review of the recent advances in disease diagnoses briefly in Sect. 2. Section 3 gives out the detailed problem definition and our proposed framework. Section 4 introduced our experiment and our discussions. Section 5 concludes this study with future work.

2 Related Work

Diagnosis is the traditional basis for decision-making in clinical practice, inferring the disease from the observations attracts more and more attention in recent years [7, 17, 22, 25, 31]. Existing disease prediction methods can be roughly divided into two categories: clinical based diagnosis [9, 22, 25] and data based diagnosis [7, 17, 31]. Most existing clinical based diagnosis need profound knowledge of medical and most of them are focused on the certain field, such as specific diseases are caused by specific germs [21]. Until the last few years, most of the techniques for computer-aided disease diagnosis were based on traditional machine learning and statistical techniques such as logistic regression, support vector machines (SVM) [27], random forests (RF) [19] and decision tree (DT) [2, 11, 24]. Recently, deep learning techniques have achieved great success in many domains through deep hierarchical feature construction and capturing long-range dependencies in an effective manner [10]. Given the rise in popularity of deep learning approaches and the increasingly vast amount of clinical electronic data, there has also been an increase in the number of publications applying deep learning to diseases diagnosis tasks [5,6,7, 20] which yield better performance than traditional methods and require less time-consuming preprocessing and feature engineering. For instance, Zhenping et al. [5] use the Best Mimic Model for ICU outcome prediction and got average 0.1 Area under Receiver Operating Characteristic (AUROC) score than SVM, LR and DT, Zachary C et al. learned to diagnose with long short-term memory (LSTM) recurrent neural networks and got average 0.5981 F1 scores over 6 different diseases.

However, all these methods are designed for a specific disease based on either the intensive use of domain-specific knowledge or taking advantage of advanced statistical methods. Specifically, studies have been conducted on Alzheimer’s disease [31], heart disease [17], chronic kidney disease [28], and abdominal aortic aneurysm [13]. Moreover, these models have been developed to anticipate needs and focused on specialized predictive models that predict a limited set of diseases. However, the day-to-day clinical practice involves an unscheduled and heterogeneous mix of scenarios and needs different prediction models in the hundreds to thousands. It is impractical to develop and deploy specialized models one by one [7]. So it is significant to develop a unified model and can apply for the majority of diseases. This is beautiful dovetails to the multi-task learning, each disease can be treated as a single learning task. Note that many approaches to multi-task learning (ML) in the literature deal with a similar setting: They assume that all tasks are associated with the single output, e.g., the multi-class MNIST dataset is typically cast as 10 binary classification tasks. More recent approaches deal with a more realistic, heterogeneous setting where each task corresponds to a unique set of output [23]. We can not simply apply their approaches to ours, because we multiple clinical observations, multiple, and multiple medical treatments cannot be integrated into the existing frameworks.

More importantly, the human body as organic entities and different systems are intimately connected, and no diseases are isolated, so there may be little difference between the complications. Therefore, based on our experiments it is hard for traditional methods to apply to such huge dataset over 50 kinds of diseases.

Inspired by the above problems, in this paper, we propose a general methodology, namely Deep Multi-source Multi-task Attention Model (DMMAM), to predict the disease from multi-modal data jointly. Here the sources indicate the clinical measurements and the medical treatments, the tasks represent the diagnosis of the diseases. In our work, the variables include not only the continuous clinical variables for regression (time series step by step regression) but also the categorical variable for classification (i.e., the class label for diseases classification). We treat the estimation of different diseases as different tasks, and multi-task learning [31] method developed in the machine learning community for joint learning. Multi-task learning can effectively increase the sample size that we are using to train the model because the samples of some kinds of disease are really small, which are not enough for learning (see Table 1). Specifically, at first, we assume that related tasks share a common relevant feature subset such as the age, temperature, heartbeat, blood pressure, et al. but with a varying amount of influence on each task, and thus adopt a hand engineered feature selection method to abstain a common feature subset for different tasks simultaneously. Then, we use a window alignment to adjust the time window between different sources and use one dense layer to reduce the dimensionality. Besides, we use two attention layer to capture the correlations between the different input sources as well as each time step. Finally, we use a gated recurrent unit (GRU) to fuse the above-selected features from each modality to estimate multiple regression and classification variables.

We will detail the problem definition in Sect. 3 and our proposed method in Sect. 4.

3 Proposed Framework

3.1 Problem Statement

For a given ICU stay length of T hours, and a collection of diagnostic results \(R_t, t\in T \), it is assumed that we have a series of clinical observation:

$$\begin{aligned} O(t) = {\left\{ \begin{array}{ll} R_t, &{} \text {if } R_t \notin \emptyset \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where O(t) is vector of bedside observations at time t. \( O(t)=P_a^i \varTheta Q_b^i\), where \(P_a^i\) represent the i-th clinical measurement at time a, \(Q_b^j\) represent the j-th medical treatment at time b, and \(\varTheta \) is a window alignment operation between \(P_a^i\) and \(Q_b^j\), and \(R_t\) represent the diagnostic result at time t. Our objective is to generate a sequence-level disease prediction at each sequence step. The type of prediction depends on the specific task and can be donated as a discrete scalar vector \(R_t^i\) for the multi-task classification. As all tasks are at least somewhat noisy, when training a model \(Task_i\), we expect to learn a good representation for \(Task_i\) that ideally ignore the data-dependent noise and generalize well. By sharing representations between related tasks, we can enable our model to generalize better on our original task.

3.2 Multi-modal Multi-task Temporal Learning Framework for Temporal Data

Inspired by Daoqiang Zhang and Dinggang Shen’s work [31], we treat the diagnosis of the diseases as a sequential multi-modal multi-task (SM3T) learning problem. The multi-modal represents the clinical measurements and the medical treatments. The tasks represent the diagnosis. The framework can simultaneously learn multiple tasks from multi-model temporal data. Figure 2 illustrates the proposed SM3T method and a comparison with the existing learning methods.

Fig. 2.
figure 2

Multi-modal multi-task temporal learning framework for temporal data.

Figure 2(a) is single-modality single-task temporal learning, each subject has only one modality of data represented as \(x_i\) at each time step, and each subject corresponds to only one task denoted as \(Y_i\), this is the most commonly used learning method; Fig. 2(b) is single-modality multi-task temporal learning the input is similar as single-task temporal learning, but each object corresponds to multiple tasks denoted as \(Y_i^1, Y_i^2, Y_i^3, ..., Y_i^n, n>1\); Fig. 2(c) is multi-modality single-task temporal learning, each subject has multiple modalities of data represented as \(x_i^1, x_i^2, x_i^3, ..., x_i^n, n>1\) at each time step and each subject corresponds to only one task denoted as \(Y_i\); Fig. 2(d) is multi-modality multi-task temporal learning, each subject has multiple modality of data represented as \(x_i^1, x_i^2, x_i^3, ..., x_i^n, n>1\) at each time step and each subject corresponds to multiple tasks denoted as \(Y_i^1, Y_i^2, Y_i^3, ..., Y_i^n, n>1\).

Similar to Zhang’s et al. [31] we can formally define the SM3T learning as below. Given N training subjects over T time span and each is having M modalities of data, represented as:

$$\begin{aligned} x_i^t=\{x_i^t(1), x_i^t(2), \dots x_i^t(m), \dots , x_i^t(M)\}, i=1,2, \dots , N \end{aligned}$$
(2)

our SM3T method jointly learns a series of models corresponding to Y different tasks denoted as:

$$\begin{aligned} Y_i=\{y_i^t(1), y_i^t(2), \dots , y_i^t(j), \dots , y_i^t(Y)\}, j=1,2, \dots , N \end{aligned}$$
(3)

Noting that SM3T is a general learning framework, and here we implement it through an attention framework as shown in Fig. 3. The x-axis represents the sequential data stream at time t, the y-axis represents the actions conducted on each t point and z-axis is the modalities of the input sources. In our experiment, \(N=2\) (e.g., \(S1=\) clinical measurements and \(S2=\) medical treatment) are used for jointly learning models corresponding to different tasks. We will detail the inner action of the SM3T framework in the following sections.

Fig. 3.
figure 3

The proposed multi-source multi-task attention model.

3.3 Input Embedding and Window Alignment

Give the R actions for each step for each step t, the first step in our model is to generate an embedding that captures the dependencies across different disease without the temporal information. In the embedding step, let N denote the number of diseases. The diagnosis process is first designed for each disease without temporal information. Let P denote the ICU patients. The \(p-th\) patient have h diagnosis results at time t, and \(p-th\) patients with \(h-th\) diseases is associated with two feature vectors \(Sa_p^h(t)\) and \(Sb_p^h(t)\) derived from the EHR, where \(Sa_p^h(t)\) donate the clinical measurements and \(Sb_p^h(t)\) donates the medical treatments. The dimension of Sa and Sb are \(\alpha \) and \(\beta \), respectively. Combined Sa and Sb, we generated a new feature vector \(\varPhi ^h\) for the \(p-th\) patient:

$$\begin{aligned} \varPhi ^p \equiv [\phi _1^p(t), \phi _2^p(t), \dots , \phi _h^p(t)] \end{aligned}$$
(4)
$$\begin{aligned} \phi _p^h(t) = \lambda ^h_1 Sa_p^h(t) \star \lambda ^h_2 Sb_p^h(t) \end{aligned}$$
(5)

where \(\star \) is Window Alignment operation, and \(\lambda _1\) and \(\lambda _2\) are trainable hyper-parameters for each disease.

Since our framework contains multiple actions, medical treatments Sb and clinical measurements Sa. The intentions of why we add a window alignment operation is that according to the common medical sense, the effect of treatment usually has some delay to the measurements. Assume \(Sa_p^h(ti)\) represent the clinical measurements at time ti and \(Sa_p^h(tj)\) represent the medical treatments at time step tj. The alignment is performed by mapping \(Sa_p^h(ti)\) and \(Sa_p^h(tj)\) into a unique time step \(S_p^h(t)\). The alignment parameters \(\lambda _i^h\) are learned according to the patients and disease respectively. We found that tj usually later than ti, and this well accords with the prevailing medical sense.

3.4 Dense Layer

To balance the computational cost as well as the predictable performance, we need to reduce the dimensions before we transfer the raw medical data to the next process step. The typical way is to concatenate an embedding at every step in the sequence. However, due to the high-dimensional of the clinical features, “cursed” representation which is not suitable for learning and inference. Inspired by the Trask’s work [29] in Natural Language Processing (NLP) and Song’s [26] in clinical data processing, we add a dense layer to unify and flatten the input features. To prevent overfitting, we set dropout = 0.38 here.

3.5 The Gated Recurrent Unit Layer

The gated recurrent unit layer (GRU) takes the sequence of action \(\{x_t\}_{t\ge 1}^T\) from the previous dense layer and then associate \(p-th\) patient with a class label vector Y along with the time span, donates the class label for the \(p-th\) patient with the \(n-th\) disease at time T. \(Y_p^n(t)\) is set ass follows:

$$\begin{aligned} Y_p^n(t)= {\left\{ \begin{array}{ll} disease ID, &{} \text{ if } \text{ diagnosis } \text{ recorded } \text{ at } \text{ time } t \\ 0, &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
(6)

We create a T-dimensional response vector for the \(p-th\) patient:

$$\begin{aligned} Y^{(p)} =(y_{p,1}, y_{p,2}, \dots , y_{p,p_t} )^\top \end{aligned}$$
(7)

For the diagnosis of ICU patients, we adopted GRU and represent the posterior probability of the outcome of patient p has \(y-th\) disease as:

$$\begin{aligned} Pr[P_y^n(t)=1|\phi _h^p(t)] = \sigma ({{\omega }^{{{(p)}^{T}}}}\phi _h^p(t)) \end{aligned}$$
(8)

where \(\phi (a)\) is the sigmoid function \(\sigma (a)\equiv {{(1+\exp (-a))}^{-1}}\) and \({{\omega }^{(p)}}\) is a \(\alpha +\beta \) dimensional model parameter vector for the \(p-th\) patient.

To learn the mutual information of data resulting from the customization, we model for all disease jointly, so that we can share the same vector space across the disease, this is very useful for those diseases with fewer samples. We represent the trainable parameters of the GRU as a \((Sa+Sb) \times T\) \(W\equiv [{\omega ^1},{{\omega }^{2}},\cdots ,{{\omega }^{t}}]\).

3.6 Multi-head Attention and Feed Forward

This attention layer is designed to capture the dependencies of the whole sequence, as we treated the diagnosis as a step-by-step process. In the ICU scenario, the actions (clinical measurements and medical treatments) closer to the current position are critical in helping the diagnosis. However, the observations further are less critical. Therefore, we should consider information entropy differently based on the positions which we make observations.

Inspired by [30], we use H-heads attention to create multiple attention graphs, and the resulting weighted representations are concatenated and linearly projected to obtain the final representation. Moreover, we also add 1D convolutional sub-layers with kernel size 2. Internally, we use two of these 1D convolutional sub-layers with ReLU (rectified linear unit) activation in between. Residue connections are used in these sub-layers. Unlike the previous work [1, 4, 7, 11] making the diagnosis only once after a specific timestamp, we give out prediction at each timestamp. This is because the diagnosis results may change during the ICU stay and we make it as a dynamic procedure. This is more helpful for the ICU clinicians because they need to know the patients’ possible disease at any time other than at the particular time. We stack the attention module N times and using the final representations in the final model. Moreover, this attention layer is task wise, that is to say if this attention will only work when this attention is helpful to the diagnosis.

3.7 Linear and Softmax Layers

The linear layer is designed to obtain the logits from the unified output of attention layer. The activation function used in this layer is ReLU. The last layer is preparing for the output based on different tasks. We use softmax to classify the different diseases, and the loss function is:

$$\begin{aligned} Loss\_d=\frac{1}{N}\sum \limits _{n=1}^{N}{-({{y}_{k}} \bullet \log ({{\overline{y}}_{k}})+(1-{{y}_{k}})).} \end{aligned}$$
(9)

where N donate the number of diseases. Due to the distribution of the training set we also introduce Focal Loss as our loss function [16].

Table 1. Description of the prediction tasks based on ICD 9 code.

4 Experiment

4.1 Data Description

We use a real-world dataset from MIMIC IIIFootnote 1 to evaluate our proposed approach. MIMIC-III is a large, publicly-available database comprising de-identified health-related data associated with approximately sixty thousand admissions of patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The open nature of the data allows clinical studies to be reproduced and improved in ways that would not otherwise be possible [12]. In our experiment, we treat each ICU stay as a single case, because different ICU stay from the same patient may have diagnosed with a different disease. Moreover, this operation can help us to obtain more samples to train. As shown in Table 1, this is the first time that disease diagnosis conduct on such huge amount categories. We category the dataset based on the International Classification of Diseases (ICD) code, ICD-9, and we select 151729 ICU admissions over 50 commonly diagnosed disease. As shown in Fig. 1, most patients have multiple complications, and we collected all the complications in the whole ICU process temporally. Unlike the previous work, we did not filter any patients, this may results low performance, compared with related work. For the features, we included 529 clinical measurements features and 330 medical treatment features. Due to the abundant and mussy training samples, the performance between different disease is hugely different.

Table 2. Experiment settings for training, validating and test.
Table 3. Performance evaluation on each diagnose task.

4.2 Experiment Settings

Our experiment includes over 40000 patients among 9 categories of 50 kinds of disease, the ICD9 code range from 001 to 779. A measure of the diagnosed disease, we set the outcome is “true” if the prediction result is right between the diagnose time span we observed the disease otherwise “false”. In the training process, we will give out predict every time step only if there are observations during this time step, but in the test process we can give out diagnosis at every time step, and the time span can be customized. The learning rate in this experiment is 0.001, and the epochs size is 30. In our experiment, we set the batch size to 32, with ADAM optimizer and set dropout = 0.35. According to our experiment, we can get most of the best performance when then attention stack for 4 times. In order to conduct all the experiment in the same data, we manually divide the training set, validation set, and test set, we listed it in the Table 2.

4.3 Compared Methods

We compared our proposed method with 6 commonly used methods, i.e., Logistic Regression (LR) with L2 regularization, Random Forest (RF), Support Vector Machine (SVM), Decision Tree (DT), GRU, and the-state-of-the-art LSTM based method [15]. Due to the page limitation we only listed the two of the top two best methods in our paper. The first one is Logistic Regression (LR) with L2 regularization, and the second is the-state-of-the-art LSTM based method we listed LSTM+ in Table 3. As mentioned above, to ensure every evaluation method uses the same data, we fixed the dataset. As shown in Table 2 the validation and test date we use is approximately \(25\%\) of the whole dataset.

4.4 Evaluation Metric

To provide a comparison among the mentioned techniques, three evaluation techniques were used in this research: F1-Measure, Accuracy, and Recall. Those evaluation techniques are defined as:

$$\begin{aligned} \text {Accuracy} = \frac{TF+TN}{TP+FP+TN+FN} \, \, \,\, \, \, \text {Recall}= \frac{TP}{TP+FN} \end{aligned}$$
(10)
$$\begin{aligned} \text {F1-Measure}= \frac{2 \times \text {Precision} \times \text {Recall} }{\text {Precision} + \text {Recall}} \end{aligned}$$
(11)

where TP and FP are the number of true positive and false negative, respectively.

4.5 Experiment Results and Discussions

Table 3 shows the prediction results. We can see that our model is significantly outperformed than all the baseline methods. Because we did not filter any ICU admissions and included all categories of the disease, so some evaluation metrics of our experiment are lower than those results appeared in Chen et al.’s work [15] (marked as LSTM+ in Table 3), but under the same experiment settings, our can always achieved the best performance. We can see that the number of the sample can greatly improve the diagnosis performance, the more samples, the better performance can achieve.

We discovered that the difference among categories are more evident than the diseases in the same category, and can pass average 3.2% in accuracy. The disease in category 3, Endocries, Nutritional, Metabolic and Immunity is the hardest disease to diagnosis in our model, and the disease of Conditions originating in the perinatal period in category 9 are the easiest ones to diagnosis. This is because there are greater diversities between category 9 and others, and there are smaller diversities between category 3 and others. Besides, the disease in the same categories have different diagnosis performance indicate that there is a higher relevance in the same system. We also conducted the ablation studies on the process of diagnosis, and the results show that the multi-source and multi-task can help us improved the performance among all the tasks over 3.6 percent in F1 scores. That is to say, by share the context feature space in the hidden layers the DMMAM can significantly improve the performance.

5 Conclusion and Future Work

In this study, we presented a new model named DMMAM for the disease diagnosis in the circumstances of the ICU. We modeled the ICU disease diagnosis as a multi-source multi-task classification problem. Moreover, we treat the diagnosis as a gradually process along the clinical measurements and the clinical treatments. The significances of our proposed model can be identified as:

  1. 1.

    We considered the diversity of complications. This both accords with the medical situation that no disease is isolated and different diseases have different diagnostic criteria and different treatment methods, the proposed multi-source multi-task model can perfectly suitable for this situations;

  2. 2.

    We considered the diagnosis sequential relationship. By introducing the attention layer we simulated the clinicians’ diagnosis process and captured the interaction information among the sequence.

  3. 3.

    Solved the imbalance problem. The sample variance among the training data is hugely among different diseases. For example, the unspecified essential hypertension has 23153 samples. However, the secondary malignant neoplasm of the lung has only 866 samples. So if we are learning diagnosis without any precautionary measures, the diagnosis result would definitely to the majority ones. By using focal loss function, we alleviated problem caused by the unbalance of the dataset in the training process.

We conducted our experiment on 50 diseases over 167884 samples the results show the robustness and high accuracy. Moreover, this is the first time that diagnosis been conducted on such huge dataset. Nevertheless, how to use these diagnoses in further clinical actions remains a challenge in scientific research, and future work can be focused on this problem.