Keywords

1 Introduction

Electronic health record (EHR) systems capture vast amounts of digital data about patients’ health status, their medical and treatment histories, and clinical outcomes. These data provide opportunities to improve healthcare delivery, reduce medical costs and, when integrated with genomic and imaging data, can enable the development of strategies for personalized medicine. However, the use of EHR data in medical research and software development is often impeded by the complexities of regulatory oversight. Because electronic health records contain patients’ information, data access and sharing are strictly controlled by rules and processes to protect patient privacy. Getting approvals for access to de-identified clinical data can be time consuming. Approvals are generally granted for specific subsets of data (new approvals are required, if a study later needs additional data subsets) with limits on how the data can be shared within and among research teams. Higher security requirements on computing and storage infrastructure put additional burden on EHR based medical and informatics research. The process of data de-identification is also time consuming and expensive, especially for large EHR datasets. Moreover, de-identified data can still pose privacy and security risks [1].

Realistic synthetic datasets that maintain the statistical properties of real datasets can mitigate the complexities of clinical data access by eliminating (or significantly reducing) privacy and security risks and can complement de-identified real clinical data in informatics and medical research [2,3,4,5,6,7,8,9,10,11]. For instance, synthetic datasets can be used for data analysis3 and cohort identification tasks [2]. They can also replace or augment real data for a more efficient development and evaluation of computerized analysis methods [4, 5]. Realistic synthetic EHR data can, in particular, benefit deep learning analysis workflows, which often require large volumes of data to train accurate and robust models. Large longitudinal EHR datasets, for example, are critical to the development of reliable predictive models, which are generally based on recurrent neural networks (RNN), such as long short-term memory [12] (LSTM) architectures. However, there are challenges in generating realistic synthetic datasets. Data heterogeneity, large numbers of data elements and types, irregularities in data, and missing values make it arduous to implement efficient methods that can produce realistic synthetic data.

We propose a novel deep learning method, coined as Longitudinal GAN (LongGAN), for generating longitudinal synthetic EHR data. A trained LongGAN model generates high-quality clinical data containing continuous laboratory and medication values for given diseases for a time period of 72 h. It can be applied to any continuous-valued longitudinal data for any reasonable time range given.

Deep learning has in recent years become the preferred method for data analysis in a wide range of applications including analysis of clinical data for identification of disease risk, outcome prediction, and the extraction and classification of clinical information. For example, deep learning methods have been used to analyze EHR data to identify the risk of opioid use disorder [13] and opioid overdose [14] in population studies and to detect miscoded diabetes diagnosis codes for quality improvement [15]. Deep learning methods have also been successfully applied to synthetic data generation in many application domains, such as text-to-image synthesis [16], video generation [17], and music generation18. Most synthetic data methods employ the Generative Adversarial Network [19] (GAN) architecture, which consists of a generator component and a discriminator (or a critic) component. The generator produces synthetic data, whereas the discriminator distinguishes between real and synthetic data. The adversarial relationship between the generator and the discriminator forces the generator to learn to produce realistic synthetic data. Several recent projects have employed GANs for synthetic EHR data generation [20,21,22,23,24,25,26,27]. Medical Generative Adversarial Network (MedGAN) [20] implements a method for generating discrete data elements (medication codes and diagnosis codes). SMOOTH-GAN [21] demonstrated GANs would generate more realistic synthetic data when binary labels are converted to continuous values by using imperfect machine learning models as heuristic functions for generating laboratory values and medications as a snapshot of patients’ records. However, most of the previous efforts have focused on producing non-longitudinal synthetic data that represent a snapshot of a patient’s medical history. Applications of GANs for synthetic time-series clinical data remain scarce, owing mainly to the fact that generating sequences requires the generated data to have not only similar overall distribution of attributes, but also similar temporal dynamics to the real sequences. Some recent efforts have resulted in methods for generating longitudinal synthetic data. Recurrent Conditional Generative Adversarial Networks (RCGAN) [22] used a RNN architecture for both the generator and the discriminator and took conditional input at each time step. The authors evaluated the performance on the eICU Collaborative Research Database with four selected regularly sampled features. Time-series Generative Adversarial Networks (TimeGAN) [23] introduced supervised loss to enforce temporal dynamic preservation and trained the generator and the discriminator in embedded space. The authors of TimeGAN measured its success on a discrete-valued lung cancer dataset. Dual Adversarial Autoencoder (DAAE) [24] made use of an inner GAN and an outer GAN to learn set-valued sequences of medical entities such as diagnosis codes.

LongGAN takes advantage of recurrent autoencoders and the Wasserstein Generative Adversarial Network with Gradient Penalty [28] (WGAN-GP) architecture. Recurrent autoencoders have been successfully applied to multivariate time series analysis such as forecasting [29] and anomaly detection [30]. They can learn useful representations of sequences while preserving temporal dynamics during the reconstruction. LongGAN leverages this property of recurrent autoencoders and adapts it to train an autoencoder model to generate realistic sequences. Our work differs from the previous work as follows: 1) Unlike regularly sampled bedside data, our data are irregularly sampled with many missing values. 2) Our data contains many features, rather than only a few handcrafted features. 3) Conditions are combined to generate realistic longitudinal data.

We evaluated the performance of LongGAN by training a logistic regression model, a random forest model, and a two-layer long short-term memory (LSTM) network model to predict acute kidney injury (AKI). These models represent examples of linear models, nonlinear models, and deep learning models, respectively. The experimental results show that predictive models trained with synthetic data from LongGAN achieve comparable Area Under the Receiver Operating Characteristics (AUROC) and Area Under the Precision Recall Curve (AUPRC) values to models trained with real data. In addition, synthetic datasets from LongGAN lead to much better models, with up to 0.27 higher AUROC and up to 0.21 higher AUPRC values, compared with synthetic data from RCGAN and TimeGAN, the two most relevant GAN-based methods for synthetic longitudinal data generation.

Beyond the realisticness of synthetic data, a key concern is protecting patient privacy (i.e., an attacker should not be able to discover the identities of patients from a synthetic dataset). We examined this aspect of LongGAN in the context of attribute disclosure attacks. The experimental results show that an attacker, who has a subset of attributes from the real dataset, could achieve a mean accuracy of 20% in predicting missing attributes with k-nearest neighbors (KNN) estimation using the synthetic dataset generated by LongGAN. This value is lower than the mean accuracy of 26% that the attacker could achieve without access to the synthetic dataset using a population median method.

2 Methods

2.1 Architecture of LongGAN

The proposed method consists of a recurrent autoencoder network and a GAN network as is shown in Fig. 1. A recurrent autoencoder is a neural network trained to copy its input sequence to its output sequence31. More specifically, it can be viewed as having two parts: the encoder Enc takes sequential data X and maps it to a dense representation h, then the decoder Dec takes h and tries to reconstruct the input from it. Here, \(X=({s}_{1}, {s}_{2},\dots , {s}_{T} )\) is a time-ordered sequence of vectors. Each vector \({s}_{i}=\left({s}_{i}^{1}, {s}_{i}^{2}, \dots , {s}_{i}^{C}\right), 1\le i\le T\) represents C features at the time point\(i\). In our implementation, the encoder and decoder both have three LSTM layers. We aim to minimize the reconstruction loss, which is:

$$ \mathop \sum \limits_{X \in \,D} \left| {\left| {X - X{^{\prime}}} \right|} \right|^{2} $$

where D is the dataset, and \(X^{\prime} = Dec\left( h \right) = Dec\left( {Enc} \right)\left( X \right)\) is the reconstruction of X.

Fig. 1.
figure 1

Architecture of the proposed LongGAN model.

The GAN network is based on the WGAN-GP architecture with conditional inputs. A GAN consists of two components: a generator \(G(z;{\theta }_{g})\) and a discriminator \(D(h;{\theta }_{d})\). The generator takes random noise and tries to generate samples that follow the same distribution of the real data. Meanwhile, the discriminator receives both real and generated data, and tries to detect whether a sample is real or fake. Ideally, the optimal generator \({G}^{*}\) would generate samples that are indistinguishable from real samples, and the discriminator would be forced to make a random guess. Conditional GANs [32] (cGANs) are extensions of GANs where the generator takes not only random noise but also some auxiliary information such as labels, to help with the generation. The objective of a conditional GAN is:

$${min}_{G}{max}_{D} V\left({\theta }_{g}, {\theta }_{d}\right)= {E}_{h\sim {P}_{h}}\left[\mathrm{log}\left(D\left(h|y\right)\right)\right]+ {E}_{z\sim {P}_{z}}[\mathrm{log}(1-D(G\left(z|y\right))]$$

Here \(h\) is the output of the pre-trained encoder, i.e. representation of real longitudinal data, \({P}_{h}\) represents the distribution of real representation, \({P}_{z}\) is the distribution of random noise (here we used Gaussian distribution), and y is the conditional input. WGAN-GP is an extension of the basic GAN architecture that improves the stability when training the model. Compared to the original GAN, it uses the Wasserstein distance instead of the Jenson-Shannon (JS) divergence, replaces the discriminator with a critic that scores the realness or fakeness of a given sample, and adds gradient penalty to enforce Lipschitz constraints on the critic. The objective function of WGAN-GP is:

$$L={E}_{\tilde{h }\sim {P}_{g}}\left[D\left(\tilde{h }\right)\right]- {E}_{h\sim {P}_{h}}\left[D\left(h\right)\right]+\lambda {E}_{\widehat{h}\sim {P}_{\widehat{h}}}[{({\Vert {\nabla }_{\widehat{h}}D(\widehat{h})\Vert }_{2}-1)}^{2}]$$

The synthetic representation \(\tilde{h }\) was fed to the pre-trained encoder to generate synthetic longitudinal data:

$$\tilde{X }=Dec\left(\tilde{h }\right)=Dec(G\left(z|y\right))$$

In our method, the generator has two leakyRelu hidden layers with α = 0.2, each followed by a batch normalization layer, and the output layer is tanh. The critic has two leakyRelu hidden layers with α = 0.2. The output layer is linear.

2.2 Training LongGAN

We extracted inpatient encounter data for adults (18+) from the Cerner Health Facts database [33,34,35], a large multi-institutional de-identified database derived from EHRs and administrative systems. The extracted data were mapped to the OHDSI Common Data Model (version 5.3) and vocabulary release (2/10/2018) [36]. We randomly chose two facilities (131 and 143) from the 10 highest volume inpatient facilities and extracted encounters from 1/1/2016 to 12/31/2017 for the experimental evaluation of the proposed method.

Medications and laboratory tests with no less than 5% appearance rate by encounter in both facilities were extracted, and the raw values were converted to quantiles. We further extracted encounters with length of stay no less than 72 h and sampled one medication/laboratory test per hour. If there were more than one measurement in an hour, medians were computed. Diagnosis codes were mapped from the International Classification of Diseases (ICD) codes to Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT). SNOMED codes and the descendant codes for Acute Kidney Injury (AKI) were combined and used as labels for our study.

The extracted datasets were extremely sparse because not all patients have measurements of all medications/laboratory tests every hour, and thus imputation was necessary. There are many approaches to impute time-series data. Here we used the interpolation part of the Interpolation-Prediction Network [37]. The Interpolation-Prediction Network is a semi-parametric network designed for irregularly sampled multivariate time series, taking into account correlations across all time series from different dimensions.

After the preprocessing, we obtained multidimensional longitudinal data for every patient, where each dimension represents the trajectory of a specific medication/laboratory test measurement from the first 72 h of hospitalization. We then trained a classifier to get smooth labels [21, 38] of AKI. More specifically, we trained a random forest model on the training set of real longitudinal EHR data with AKI as labels to assign probabilities of patients’ developing AKI, and then adjusted these probabilities to obtain smooth labels. The adjustment is done as follows:

$$ {\text{SmoothLabel}}\left( {{\text{X}}_{{{\text{prob}}}} {\text{, X}}_{{{\text{label}}}} } \right) = \left\{ {\begin{array}{*{20}l} {0.49,{\kern 1pt} {\kern 1pt} if {\text{X}}_{{{\text{prob}}}} > 0.5{\kern 1pt} {\kern 1pt} and\,{\text{X}}_{{{\text{label}}}} = = 0} \hfill \\ {0.51,{\kern 1pt} {\kern 1pt} if {\text{X}}_{{{\text{prob}}}} < 0.5{\kern 1pt} {\kern 1pt} and\,{\text{X}}_{{{\text{label}}}} = = 1} \hfill \\ {{\text{X}}_{{{\text{prob}}}} ,{\kern 1pt} {\kern 1pt} otherwise} \hfill \\ \end{array} } \right. $$

Here X_prob is the probability of getting AKI assigned by the trained classifier, and X_label is the original binary valued label for AKI.

To train a synthetic data generation model, we first pre-trained the encoder and decoder with the real EHR data with reconstruction loss. We then took the output of the encoder, i.e., the representation of the input data, and trained the WGAN-GP to produce synthetic representations. The smooth labels of AKI were used as conditional input for both the generator and the discriminator. Finally, the generated representations were input into the trained decoder to obtain synthetic longitudinal data.

The method was implemented in Python v3.6. The random forest and logistic regression method were implemented using the scikit-learn package [39]. The recurrent encoder network and the GAN network were developed using Tensorflow [40]. Other libraries used include Python Numpy41, Python Pandas [42], and Python Scipy [43]. Training was performed on an NVIDIA Tesla V100 (16 GB RAM).

3 Results

3.1 Evaluation of Realism

We have evaluated the performance of LongGAN by training traditional machine learning models and RNNs to predict whether or not a patient will develop AKI based on the medication and laboratory results from the first 72 h of hospitalization. In our experiments we used logistic regression, random forest, and a two-layer LSTM network as examples of linear models, nonlinear models, and neural networks, respectively. In each case we trained two models, one using the real training dataset and the other using the synthetic dataset, and then evaluated both models on a real test dataset. This approach, called Train on Synthetic and Test on Real (TSTR), is a common mechanism with which to evaluate the realism of synthetic data [21, 22]. Since logistic regression and random forest are not designed for time-series data, we flattened the sequence along the time dimension as input for these two algorithms. We measured the performances of the models with AUROC and AUPRC as they are commonly used metrics for TSTR [21, 22].

We compared our method with RCGAN and TimeGAN. Since TimeGAN is not designed for conditional generation, we trained two TimeGAN models on positive cases and negative cases separately to generate synthetic data with both cases. Table 1 shows the experimental evaluation results. Our results demonstrate that the models trained on synthetic datasets generated by LongGAN have performances closer to those trained on real datasets than other synthetic datasets generated by RCGAN and TimeGAN. Models trained with synthetic data from LongGAN achieved up to 0.27 higher AUROC and up to 0.21 higher AUPRC values than models trained with data from RCGAN and TimeGAN.

Table 1. Performance of trained predictive models on real and synthetic datasets.

In the next set of experiments, we examined whether models trained with the synthetic dataset selected a similar set of features for prediction compared with models trained with the real dataset. To this end, we extracted the top 15 most important features of the random forest models trained with the real and synthetic datasets. Table 2 shows the list of features from each random forest model. Our experiments show that 10 features overlap between the two models.

Table 2. Top 15 most important features of random forest model trained on real/synthetic datasets.

3.2 Evaluation of Privacy Preservation

A critical requirement for a synthetic EHR data generator is that it must preserve patient privacy. In this section we evaluate this aspect of our method with respect to attribute disclosure attacks. Attribute disclosure occurs when attackers can derive target attributes about a patient based on key attributes that they already know about the patient [8, 44]. This is a prominent issue for synthetic datasets as attackers might gain sensitive knowledge of real patients based on similar records in a given synthetic dataset.

We assume the attacker has full access to the synthetic dataset and partial access to the real dataset. This is a commonly adopted setting for evaluating the attribute disclosure risk [20, 45]. More specifically, we randomly sampled 1% of patients from the real training set as the compromised records, flattened them along the time dimension, and randomly masked 10% of the attributes as the set of target attributes that are unknown to the attacker.

While there are different potential attack methods for synthetic dataset [20, 46, 47], in this paper, due to space limitation, we focused on KNN estimation, a common method considered for privacy preserving evaluation. For each compromised record, we retrieved its k-nearest neighbors in the synthetic dataset based on the key attributes and estimated the target attribute using the median of corresponding attributes of these k neighbors. We call an estimation accurate, if the relative error of the estimation is below 5%. We used a dummy baseline where the attacker simply guesses the median value in the population. Here it is 0.5 since our data are in quantile. This simulates the attacker’s behavior when they have no knowledge of the original dataset or the synthetic dataset and have to make estimations uniformly at random [46].

The idea is that a privacy-preserving synthetic dataset should avoid providing the attacker with additional knowledge for better estimation of target attributes, in order to minimize the risk of attribute disclosure. We repeated the experiment for 30 times, with different records of patients randomly selected and different attributes randomly masked and mean accuracy computed for all masked attributes. The experiments showed that with the KNN estimation the attacker on average achieved a mean accuracy of 20%, while with the estimation of the population median the mean accuracy was 26%. The paired samples t-test of the mean accuracies from different experiments resulted in a p-value of 7.12e−23. This indicates that the mean accuracy from the KNN estimation was significantly smaller than that of random guess, suggesting that in the given scenario, an attacker using KNN estimation cannot do better than random guess.

4 Discussion

Generating synthetic clinical data has great potential for researchers to conduct competitive and reproducible research with electronic health records without privacy concerns. However, very few works have tackled the problems of generating continuous time-series clinical data. We have proposed a model that combines a recurrent autoencoder and WGAN-GP to generate realistic time-series data containing continuous laboratory and medication values for given diseases. While we focused on a specific disease (AKI) in our experimental evaluation, the methodology is universal and can be applied in the context of other diseases.

4.1 Comparison with Previous Work

In Esteban’s work on RCGAN, they used RNNs (LSTM) as the generator and the discriminator, and the labels were fed to the generator and the discriminator at every time step [22]. In Yoon’s work on TimeGAN, they used RNNs as embedding and recovery functions to provide mappings between feature space and latent space, and then trained the GAN within the latent space [23]. The GAN aspect of TimeGAN also utilized the RNNs as both the generator and the discriminator. In addition, another RNN (called supervisor in the paper) was added to enforce the generated longitudinal data having similar temporal relationships to the real longitudinal data.

GANs and RNNs can both be hard to train [28, 48], and using the RNN structures in GANs would intuitively introduce instability in training. Compared with previous studies, the key difference of our study is that we managed to bypass the RNN structure in the GAN. We accomplished this by taking advantage of a pre-trained recurrent autoencoder and transformed the problem of generating sequences to the problem of generating dense representations of sequences. Since the generated representations are input to the decoder, which was trained on real longitudinal data, the generated longitudinal data would maintain similar temporal dynamics to the real dataset. Our model also differs from previous work in that we took advantage of WGAN-GP and smooth labels, which made the training more stable. Moreover, our model requires minimal domain knowledge to make hand-crafted features, rendering it more generalizable.

Our model achieved much better AUROC and AUPRC values than the baseline models in predictive modeling tasks. The significant overlap of top features between models trained with synthetic data and those trained with real data suggests LongGAN can generate realistic synthetic data which can in turn be used to complement or replace real data for training machine learning models. The experiments on attribute disclosure demonstrated that an attacker cannot reliably obtain additional information about real patients with help of our generated dataset, which minimizes the concerns for privacy issues.

4.2 Limitations

The datasets extracted from the Health Facts database contain many missing values, because not all patients have measurements of all medications/laboratory tests every hour. We performed imputation to obtain fixed-length longitudinal data to fit the model. However, in this process we also eliminated any patterns of the missing data itself, which could contain useful information about patients [49]. While our method generates synthetic data that are similar to the imputed data, it does not have patterns of missing data like the original datasets do.

5 Conclusion and Future Work

LongGAN is a new approach to generating synthetic longitudinal EHR data. It can produce synthetic datasets that enable training of machine/deep learning models with comparable predictive performances to those of models trained with real data. For future work, we shall investigate how to combine the transformer [50] architecture with GAN and implement extensions to produce synthetic data on demographics and preserve patterns of missing data. Transformer networks have achieved great success in natural language processing tasks [51, 52] and have been shown to be powerful tools for extracting useful features of sequences [53, 54]. We will also explore other aspects of privacy attack and preservation and differentially private training methods [55, 56], in order to further minimize or eliminate the risk of information leakage.