1 Introduction

The intensive care unit (ICU), and the Neurologic ICU (NICU) in particular, collects a myriad of data about their patients. Some are physiologic, and some are clinical. In addition, time is of the essence to detect adverse events that arise as secondary complications. In this paper, we focus on patients with subarachnoid hemorrhage (SAH), one of the most common disease entities treated in the NICU [1, 2]. Our interest is in predicting the secondary complication of delayed cerebral ischemia (DCI) from vasospasm (VSP).

SAH is a devastating illness and a major public health burden, estimated as 14.5 per 100,000 persons in the United States alone [3, 4]. Poor outcomes occur after survival from the initial aneurysm rupture with 15% mortality and 58% functional disability; of which 26% have persistent dependence [5]. Additionally, as many as 20% of patients have global cognitive impairment contributing to poor functional status [6]. Thereby, SAH is associated with a substantial burden on health care resources, most of which are related to long-term care for functional and cognitive disability [7]. Much of the resulting functional and cognitive disability is due to DCI from VSP [7,8,9,10,11]. VSP refers to the narrowing of cerebral blood vessels triggered after a ruptured aneurysm due to the unusual presence of blood surrounding the vessel. It will occur in 30% of SAH patients [12, 13] (up to 54% for SAH patients in coma [14]). In its extreme, severe VSP precludes blood flow to brain tissue, resulting in stroke. DCI is defined as the development of new focal neurological signs or decrease of ≥ 2 points on the Glasgow Coma Scale (GCS), lasting for > 1 h, or the appearance of new infarctions on CT or MRI [15, 16]. The underlying pathophysiology is VSP, and other causes are thus excluded.

For a syndrome with subtle symptoms and time sensitivity, it is helpful to be accurate in prediction so clinicians can remain vigilant for detection. Clinicians use a static instrument called the Modified Fisher Scale (mFS) to predict likelihood of DCI, based on the volume and pattern of blood on the initial brain computed tomography scan (CT) [17,18,19]. Resource planning and monitoring intensity are scripted around this prediction. Prevention, detection, and management of secondary complications generate a large health care burden for SAH patients [20, 21]. For the higher risk SAH patients, the first 10–14 days are occupied by efforts to detect subtle examination changes that suggest VSP (highest risk: post-bleed days 4–12) [22], and arrange urgent interventions to prevent permanent injury. The only noninvasive tool supported by guidelines to potentially identify asymptomatic VSP is the transcranial Doppler (TCD), which can have poor sensitivity and negative predictive values, and is subject to technician availability and poor interrater reliability [23,24,25,26,27,28,29]. Asymptomatic VSP left unchecked may progress to symptomatic VSP, which is dependent on the consciousness of the patient and quality and availability of expertise in the complex and diurnal environment of the NICU. Discharging patients from the ICU at low risk for DCI can result in significant cost savings [30].

Existing predictive models of DCI and VSP after spontaneous SAH are non-dynamic and while they may help risk-stratify patients, they can lack accuracy and precision when applied to individuals. The initial head CT assessment of blood thickness and distribution has spawned 3 grading scales to assess the likelihood of the development of DCI [19], angiographic VSP [18], or symptomatic VSP [31]. DCI is thought to be a more meaningful outcome than symptomatic VSP, especially in patients with severe SAH whose neurologic exam may be limited thus allowing for deterioration to go unrecognized. Such grading scales are performed on static assessments of radiology at admission and are associated with differential odds ratios of outcome [19]. They are not precise predictors of DCI for individual patients. Efforts have been made to improve this early prediction without additional monitoring with moderate results, by combining risk scores [32], incorporating baseline features such as clinical condition and age [33], or assessment of autoregulation [34]. Few efforts have explored time series physiological data for the early prediction of DCI.

In prior proof of concept work (SP) [35], a hypothesis-driven approach to angiographic VSP classification using 24–48 h summary statistics of passively collected electronic health record data (cerebrospinal fluid drainage volume, mean arterial blood pressure, heart rate (HR), intracranial pressure, sodium and glucose) performed with a moderately favorable AUC of 0.71. The raw data used in that study was low frequency (hourly at best) and extracted features summarized over 24 or 48 h. This result was encouraging that EMR and physiologic data could allow risk stratification for future events. The question remains whether increased precision can be achieved with use of higher frequency data.

There is an extensive literature regarding robust feature extraction from physiological time series data for outcome prediction. Approaches can be broadly classified as either hypothesis driven or data driven. Hypothesis driven approaches have focused primarily on temporal data abstraction that relies on knowledge-based symbolic representations of clinical states, either by a priori threshold setting or interval changes [36, 37], summary statistics [35, 38,39,40], or template matching [41]. Hypothesis driven feature extraction can be effective in prediction but requires domain expertise in designing metafeatures, and may introduce a bias [38]. Data driven or learning approaches extract meaningful features directly from the labeled data without a priori hypothesis [42,43,44,45,46,47,48,49,50,51]. The data-driven approach of featurization via random kernels [52, 53] has shown promise in the field of image classification [54]. Random kernels, when convolved with unknown images, extract features that are frequency selective and translation invariant, characteristics that are also desired when processing temporal physiologic data. In our approach, we apply random kernels to extract features from high frequency temporal physiologic data that maximally classify for DCI.

2 Patients and methods

2.1 Study population

Consecutive patients with spontaneous SAH admitted to the Columbia University Medical Center NICU between August 1996 and December 2014 were prospectively enrolled in an observational cohort study of SAH patients designed to identify novel risk factors for secondary injury and poor outcome. The study was approved by the medical center Institutional Review Board. In all cases, written informed consent was obtained from the patient or a surrogate. SAH secondary to perimesencephalic bleeds, trauma, AVM, and patients < 18 years old were not enrolled in the study. Starting in 2006, physiologic data was acquired using a high-resolution acquisition system (BedmasterEX; Excel Medical Electronics Inc, Jupiter, FL, USA) from General Electric Solar 8000i monitors (Port Washington, NY, USA; 2006–2013) or Philips Intellivue MP70 monitors (Amsterdam, The Netherlands; 2013–2014) at 0.2 Hz.

Exclusion criteria for this project were the following: (1) absence of physiologic monitoring data (before 2006), (2) VSP or DCI before post bleed day (PBD) 3, and (3) patients missing all candidate features. The targeted classification outcome was DCI, defined as development of new focal neurologic signs or deterioration of consciousness for > 1 h or appearance of new infarctions on imaging due to VSP [16]. This was adjudicated by consensus among the treating neurointensivists during a weekly meeting, as part of the observational cohort study.

2.2 Data analysis

Data analysis, and model building were performed using custom software developed in Matlab 2016a (Mathworks, Natick, MA) and Python (http://www.python.org). Figure 1 shows a flowchart of the data processing and analysis. The input to the model was physiologic data sampled at 0.2 Hz limited to the PBD 0–3. The target classification outcome was DCI (beyond PBD 3).

Fig. 1
figure 1

Overview of the approach

2.2.1 Baseline candidate features and outcomes

The following baseline characteristics, grading scales, and outcomes were prospectively recorded at admission: age, sex, worst Hunt-Hess grade in first 24 h (HH), mFS, admission GCS, length of stay, timing of DCI, mortality, and Modified Rankin Scale (MRS). HH grade was dichotomized into low grade (1–3) and high grade (4–5). MFS was dichotomized into low grade (0–2) and high grade (3–4). MRS was dichotomized into good outcome (0–3) and poor outcome (4–6). Baseline features and outcomes were compared for patients with DCI versus no DCI. Baseline features were also compared for the derivation versus validation dataset.

Frequency comparisons for categorical variables were performed by Fisher exact test. Two-group comparisons of continuous variables were performed with the Mann–Whitney U test. All statistical tests were two-tailed and a p-value < 0.05 was considered statistically significant.

2.2.2 Physiologic feature extraction using random kernels

While 0.2 Hz physiological data was available, we remained agnostic about the optimal scale or sampling rate for DCI classification. Five universally available ICU variables [HR, respiratory rate (RR), systolic blood pressure (SBP), diastolic blood pressure (DBP), and oxygen saturation (O2)] were downsampled (ds) from 0.2 Hz to 1, 5, 10, 20, 60 min, 2, and 4 h. Downsampling was computed as medians, which dealt with erroneous data [55]. Data was truncated to PBD 0–3 and zero-padding was performed for missing data as a pre-processing step. We selected random kernels [52,53,54] to be applied and assessed for maximal convolution as shown in Fig. 2. In particular, we first generated random kernels or filter h of size k by sampling values from the normal distribution N(0,1). Next, given a time series \(~X \in {R^{1 \times t}}\), convolving filter \(~h \in {R^{1 \times k}}\), where \(~k \le t\), the feature \({f_i}\) for series \(X\) and filter \(h\) is given by \(\hbox{max} (f*h)\), where \(*\) denotes the valid convolution. Valid convolution means that \(f\) is applied only at each position of \(x\) such that \(f\) lies within\(~x\). In other words, we performed convolutions only when the contiguous data length was twice the length of kernel.

Fig. 2
figure 2

Feature extraction from physiologic time series data. 20 randomly chosen kernels were applied and assessed for maximal convolution, for each varying kernel length (kl; 2, 5, 10, 20) and for each downsampling period (ds; 1, 5, 10, 20, 60, 120, 240) and for each of five variables (var; HR, RR, SBP, DBP, O2). This resulted in a convolution matrix of 2800 candidate features

We selected 20 random kernels for each of five variables (var: HR, RR, SBP, DBP, O2), for each varying kernel length (kl; 2, 5, 10, 20) and for 5 s data and each downsampling period (ds; 1, 5, 10, 20, 60, 120, 240 min). For 5 s data, we used larger size kernel lengths (kl; 20, 60, 120, 180) with the assumption that smaller kernel lengths (10–100 s total range) would result in clinically irrelevant features which would cause the models to over-fit the dataset. This resulted in 3200 candidate random kernel derived physiological features. As the number of kernels increases, the computational complexity to generate the random feature increases. Increasing the number of kernels per ds/kl/var combination beyond 20 did not affect our analysis, and thus was chosen to maximize performance while minimizing computational complexity.

Physiologic data was limited to the first 4 days after aneurysm rupture to limit the influence of clinical treatment in response to suspected VSP or DCI [22].

2.2.3 Feature selection and model building

Minimal Redundancy Maximal Relevance (mRMR) [56,57,58] was applied to identify the most relevant features for classification. MRMR selects the features that maximize the mutual information between features and target class, and minimizes mutual information among the features. The features are ranked based on the greedy search that maximizes the Mutual Information Difference Criterion (MID) or Mutual Information Quotient Criterion (mRMR-Q). Let \(S \in \{ {x_1}, \ldots ,~{x_n}\}\) be the set of features and \(h\) be the target class (in our case DCI vs. non DCI) then the features are ranked as

$${\text{MID}}~:~\mathop {\hbox{max} }\limits_{{{x_i} \in S}} \left[ {I\left( {{x_i},~h} \right) - \frac{1}{{\left| S \right|}}\mathop \sum \limits_{{{x_j} \in S}} I\left( {{x_i},{x_j}} \right)} \right],$$
$${\text{MIQ}}~:~\mathop {\hbox{max} }\limits_{{{x_i} \in S}} \left[ {\frac{{I\left( {{x_i},~h} \right)}}{{\frac{1}{{\left| S \right|}}\mathop \sum \nolimits_{{{x_j} \in S}} I\left( {{x_i},{x_j}} \right)}}} \right],$$

where \(I\left( {{x_i},~h} \right)\) is the information gain between the feature \({x_i}\) and target class \(h.\) The first ‘k’ ranked features are then used to learn the classifier. This simplifies the model, reduces training times, and enhances the generalizability of the classification model.

We used mRMR in combination with linear and kernel based Support Vector Machines (SVM-L and SMV-K) classifiers [59, 60], as well as Partial Least Squares (PLS) regression [61] for combined feature selection and classification. The mRMR feature selection criteria identified the top 1600 (50%) features from 3200 physiologic and 5 baseline demographics/scales. These features were then used by the classifiers to learn the model. PLS regression performs a principal component analysis on all feature vectors first and then applies a least squares regression using those components that explain the most variance. Weighted SVM was utilized to account for the imbalance in classification categories (i.e. fewer DCI vs. non-DCI in any consecutive SAH dataset).

All classifiers were trained for binary outcome of DCI presence/absence. We trained our models on single variables as well as collections of variables, to identify the optimal combination of features that was most informative (i.e. maximally performing) for classifying DCI.

2.2.3.1 Weighted support Vector Machines and imbalanced class sizes

Given a training set \(\left\{ {\left\{ {{x_1},{y_1}} \right\}, \ldots ,~\{ {x_N},~{y_N}\} } \right\}\), where \({x_i} \in {R^N}\) and \({y_i} \in \{ - 1,1\}\), then the SVM problem can be formulated as

$$\begin{array}{*{20}{l}} {\mathop {\mathbf{min}}\limits_{{{\varvec{w}}, \varvec{\epsilon} ,{\varvec{b}}}} \left\{ {\frac{1}{2}{{\varvec{||w||}}^2}+{\varvec{\delta}_{{\varvec{i}}{\varvec{c}}}}*{\varvec{C}}\mathop \sum \limits_{{{\varvec{i}}=1}}^{{\varvec{n}}} { \varvec{\epsilon} _{\varvec{i}}}} \right\}} \\ {{\varvec{subject}}~{\varvec{to}}~} \\ {{{\varvec{y}}_{\varvec{i}}}\left( {{\varvec{w}}.{{\varvec{x}}_{\varvec{i}}} - {\varvec{b}}} \right) \ge 1 - { \varvec{\epsilon} _{\varvec{i}}};~{ \varvec{\epsilon} _{\varvec{i}}} \ge 0} \end{array}$$

where \({\varvec{w}}.{{\varvec{x}}_{\varvec{i}}} - {\varvec{b}}\) is the hyperplane that separates the two classes, and \({ \in _{\varvec{i}}}={\varvec{m}}{\varvec{a}}{\varvec{x}}~(0,1 - {{\varvec{y}}_{\varvec{i}}}({\varvec{w}}.{{\varvec{x}}_{\varvec{i}}} - {\varvec{b}}))\) is the slack variable (a means for relaxing the constraint by considering points for which our constraint can fail). \(C\) is the tradeoff parameter that controls the slack variable; a small \(C\) allows constraints to be easily ignored and a large \(C\) makes constraints hard to ignore. For traditional SVM, \({\delta _{ic}}=1\), meaning the optimization function penalizes equally for all data points. For cases with imbalanced class sizes, this biases the classifier in favor of the class with a larger sample size. To overcome that issue we have added a penalty term \({\delta _{ic}}\), that is set based on the total number of samples for a given class, setting higher values for the class with smaller size. We used LibSVM [62] for the SVM and weighted SVM classification.

2.3 Internal validation and validation strategy

The cohort was randomly split 80/20%, while maintaining proportional targeted outcome (DCI). 80% were used to train/test models, and considered the primary derivation dataset. For internal validation of our models, we performed cross-validation of the derivation data with a 12.5% hold-out set; the hold-out set was proportional to the training data set for percentage of targeted outcome. The discriminative performance is described by an area under the receiver operating characteristic curve (AUC). The median value of AUC is reported, over 100 runs. 20% of the cohort were not involved in model training, and used exclusively for testing the classification accuracy of our models. Classification accuracy of our models on the validation test set is reported as AUC, with 95% confidence intervals (CI). An overview of the analytical approach is illustrated in Fig. 1.

3 Results

From August 1996 to December 2014, 1595 SAH patients were enrolled in SHOP. 562 SAH patients with physiologic data were available from May 2006 to December 2014. 8 had VSP or DCI identified before PBD 3, 66 were missing all candidate features leaving a total of 488 subjects included into the study (Fig. 1).

Table 1 displays the baseline features, grading scales, and outcomes of subjects with and without DCI. DCI was found in 94 subjects (19.3%) in the entire cohort; 75 (19.2%) of the derivation set and 19 (19.4%) of the test set. Patients with DCI were more frequently women (82 vs. 65% p = 0.001). None of the grading scales were found to be significantly different between these two groups (HH p = 0.27; MFS p = 0.69; GCS p = 0.16). Length of stay was significantly longer in the DCI group (18.8 ± 6.6 vs. 9.4 ± 7.1 days, p < 0.0001). When DCI occurred, it occurred on day 7.1 (± 2.6 days). No DCI was associated with significantly more mortality (19.8 vs. 8.5%, p = 0.0098); notably 76.9% of mortality in the no DCI group occurred before mean day of DCI onset (7.1). MRS at 3 months favored good outcome for patients without DCI (p = 0.0236).

Table 1 Baseline features, grading scales, and outcomes of subjects with and without DCI

Table 2 displays the baseline features, grading scales, and outcomes of the derivation and validation groups that were randomly selected while maintaining proportional targeted outcome. There was no significant difference found between the two groups.

Table 2 Baseline features, grading scales, and outcomes of the derivation and validation groups

3.1 Model performance

The median AUC of 100 runs of cross-validation (with 12.5% hold-out set) is presented in Table 3. Among the baseline characteristics, sex (AUC 0.59) performed slightly better than age (AUC 0.57, PLS). HH (AUC 0.58, PLS) and GCS (AUC 0.58, SVM-K) achieved better accuracy for DCI prediction than MFS (AUC 0.53, SVM-L). By combining baseline characteristics and grading scales (age, sex, HH, mFS, GCS), a PLS classifier performed better than the individual features with an AUC of 0.64. A combination of all random kernel derived physiological features using a PLS classifier achieved an AUC of 0.71, while an SVM-L classifier achieved 0.72. Adding baseline characteristics and grading scales did not improve the performance.

Table 3 Model performance in derivation and validation datasets

Feature reduction with mRMR achieved the best classification performance with an AUC of 0.77 (PLS). In the case of the PLS classifier, the weights indicate the discriminative power of the features in separating the two classes. Figure 3a shows the PLS weights of these features; Fig. 3b shows the kernels corresponding to a demonstrative selection of the top ten features selected by the PLS model. The kernel associated with the feature O2 with downsampling (DS) 1 and kernel length (KL) 120 was given the highest weight among the 1600 features selected by the mRMR. The kernel displays the time varying characteristics for different variables and highlights the need for capturing high frequency data at different scales (downsampling rate).

Fig. 3
figure 3

Features selected by Minimum Redundancy–Maximum Relevance (mRMR) and classification by partial least squares (PLS). a A very large candidate feature set was reduced by mRMR, selecting for the top 50% of least redundant and most informative features. A PLS classifier was trained on these features, the weights are shown. b For demonstration, the kernels for the top 10 weighted features selected by PLS are visualized, showing the time varying characteristics captured by the random kernels. Kernel length is represented on the x-axis

The classification accuracy of the derived models on the unseen validation test set was found to be similar to the estimated performance of the cross-validated models. AUC curves are shown in Fig. 4. A model based on the traditional grading scale (MFS) achieved an AUC of 0.62 (SVM-L, 95% CI 0.5–0.74). Adding demographics and other baseline scales did not improve prediction (AUC 0.63, 95% CI 0.51–0.75, SVM-L). Adding random kernel derived physiologic features improved prediction (AUC of 0.75, 95% CI 0.5–0.91, SVM-L), but actually performed the same as random kernel derived physiologic features alone (AUC 0.74, 95% CI 0.52–0.92, PLS). This combined physiologic model (HR, SBP, DBP, RR, and O2 sat) was more predictive than any single variable model (AUC range 0.47–0.70). Feature reduction (to reduce redundancy and maximal relevance) when applied to combined baseline, grading scales, and physiologic data produced the best classification performance with an AUC of 0.77 (95% CI 0.55–0.94, SVM-L).

Fig. 4
figure 4

AUC curves of model performance on validation dataset

4 Discussion

Recognizing trends and patterns, and minutely analyzing complex data requires the layered knowledge of clinical experts, but defies rule-based systems. In prior work (SP) using summary statistics of 24 h (lower frequency) data, a Naïve Bayes classifier for angiographic VSP outperformed TCD and exams, and generated an AUC of 0.71 [35]. Here, we show that features extracted from higher frequency temporal data (ranging from 1 min to 4 h) may be superior to lower frequency data in classification of outcome after SAH.

In our approach, we extracted high level features from existing physiologic data, without an a priori hypothesis of what patterns might emerge. To enable validation efforts and generalizability to other datasets and institutions, we focused on universal physiologic ICU variables and typical baseline grading scales pertinent to SAH used in the NICU. In this translational work, we used a random featurization method to extract frequency selective and translation invariant characteristics of time series data. The novel application to time series data required some choices bound by characteristics of the dataset (kernel lengths) and domain (downsampling rates). We tested our method for its discriminative ability for DCI and found that random kernel derived physiologic features outperformed current static grading scales. When combined with grading scales and demographics, our random kernel derived physiologic features predicted DCI with an AUC (0.77, PLS) approaching clinical reliability (threshold of 0.8 [63]). PLS and SVM-L models performed equally well, indicating the sufficient discriminative ability of our feature extraction method.

An effort to show robustness of the model was an internal validation strategy, testing on a separate dataset excluded from model building entirely. Generalizability of a machine learning algorithm, however, assumes that the training dataset is large and diverse enough to be representative. A limitation to this study is the single center approach; there is no publicly available dataset for SAH with similar granularity of physiologic data. Future efforts will include developing complementary SAH cohorts and validating these algorithms.

5 Conclusions

A random kernel featurization and learning approach to physiological time series data prior to peak DCI period shows promise to improve prediction precision. This is a computationally inexpensive and agnostic feature extraction approach for physiologic time series parameters in the ICU (HR, RR, SBP, DBP, O2 sat). There is a vast pool of candidate features within the EMR with a biological basis for classification ability (i.e. drawn from frequentist statistical studies showing relationship with VSP and DCI in specific SAH cohorts). Future efforts will also draw from this feature pool to further improve the precision of DCI prediction, favoring those candidate features that are obtained for standard clinical care and thus potentially automatable.