Introduction

In neurological intensive care units (NICUs) we are collecting an ever increasing quantity of data. These range from patient demographics and physiological monitoring to treatment strategies and outcomes. The BrainIT database is an example of this type of rich data source. It contains validated data on 264 patients who suffered traumatic brain injury (TBI) admitted to 22 NICUs in 11 European countries between March 2003 and July 2005 [1, 6].

The application of data mining techniques may allow us to identify patterns or relationships in clinical databases and lead to new insights that improve patient care [9]. Cluster analysis is a form of data mining that allows researchers to analyse a data set in its entirety and provides the opportunity to identify patterns within the data [4]. The aim of cluster analysis, unlike simply categorising data, is to identify linked groups based on multivariate factors. Reasons for the formation of these groups can then be hypothesised and tested in a clinically appropriate scientific manner.

Cluster analysis has previously been applied to high-frequency physiological data collected from adult patients following TBI or polytrauma and from paediatric patients following TBI [2, 8, 10]. It has been possible to identify complex patient states that may be used to predict patient outcome [10]. However, these previous studies have been restricted to small numbers of patients and focussed only on continuous variables. The BrainIT database contains data on a much larger number of patients and the data are continuous, ordinal and categorical. We therefore describe a pilot study using cluster analysis to identify distinct groups of patients within the BrainIT database.

Materials and Methods

Data Processing

The BrainIT database is organised into nine large data tables: demographic and one-off clinical data; daily management data; laboratory data; event data; surgical procedures; monitoring data summary; neurological event summary; targeted therapies; vital monitoring data. For the purposes of this pilot study we decided to focus on the demographic and one-off clinical data. As mentioned above, this would mean dealing with a mixture of continuous, ordinal and categorical data. Also, the quantity of data involved compared with the vital monitoring was far less computationally demanding. The table contains 109 variables including age, gender, hospital admission and transfer times, diagnosis, admission physiological variables and Glasgow Coma Scale (GCS) score, admission laboratory results and Glasgow Outcome Scale (GOS) score.

The data table was processed to optimise its clinical relevance and suitability for cluster analysis. For example, dates and times were converted to “time to” values, with time of trauma being time zero. Any negative or unrealistically large values identified were examined and manually corrected if an obvious date/time input error was present, or else made blank. Additionally, multiple units used for hydrogen ion, haemoglobin and glucose concentrations were converted to a standard. An appendix documenting all of the alterations made to the data table and their rationale will be made available with the BrainIT database.

We further processed the data table to create an increased data density for cluster analysis. To achieve this, we limited analysis to variables with at least 70 % data entered and patients with at least 65 % of these variables measured. Ultimately, we created two final tables, with one containing a combination of categorical, ordinal and continuous data and another containing continuous data only.

Cluster Analysis

Gower’s dissimilarity metric was calculated to determine the difference between patients based on each of the measured variables [3]. It is regarded as the most appropriate measure when performing analysis on mixed data types. The resulting metric was then used to perform agglomerative hierarchical cluster analysis. This is illustrated in Fig. 1.

Fig. 1
figure 1

Illustration of the stages of agglomerative hierarchical cluster analysis. Each patient is initially in their own cluster. Clusters are successively merged according to the degree of dissimilarity. Finally, all patients are included in a single cluster

All processing of data tables and cluster analysis was performed in r studio version 0.95.258 with r statistics version 2.13.1. The Gower dissimilarity metric was calculated using the gower.dist function and agglomerative hierarchical cluster analysis using the function hclust.

Results

Data Processing

Processing of the demographic and one-off clinical data table resulted in the creation of one data table containing 42 variables for 251 patients and another containing 10 continuous variables for 203 patients (Fig. 2).

Fig. 2
figure 2

Data processing stages leading to the creation of two data tables for cluster analysis

Cluster Analysis

Agglomerative hierarchical clustering of the combined data revealed five clusters selected based on visual inspection of the resulting dendrogram (Fig. 3). Cluster B contained most patients (160), clusters A, C and D were of similar size (33, 22 and 33) and cluster E was an outlier with only three patients (Table 1). The features of clusters A and D were the most clinically interesting. Patients in cluster A tended to be older and were most likely to have: consumed alcohol; fallen or been assaulted; previous dysfunction; mass lesion on CT (computed tomography). Despite having reasonable GCS motor scores on admission they had the highest mortality. Patients in cluster D tended to be younger and were most likely to have: been in a road traffic accident; associated multi-trauma; hypoxia and hypotension on admission; lowest GCS motor score; dilated and non-reactive pupils; lowest GOS.

Fig. 3
figure 3

Cluster dendrogram output through agglomerative hierarchical cluster analysis of the combined data table. The division of the dendrogram to produce five clusters is demonstrated

Table 1 Selected group characteristics from the cluster analysis of all available data

Hierarchical cluster analysis of the continuous data table revealed three clusters based on visual inspection of the dendrogram. Cluster A contained most of the patients (175), whereas clusters B and C were of equal size (14; Table 2). Cluster C seems to have the most unique features with a tendency to lower SaO2 and PaO2 values, associated with increased transfer time from the pre-neurosurgical hospital (PNSH) to the neurosurgical hospital (NSH).

Table 2 Group characteristics from the cluster analysis of continuous data

Discussion

This is a pilot project of cluster analysis of the BrainIT database and is the first analysis of its type in this rich data source. However, results produced through cluster analysis must be interpreted with caution. The clusters formed are dependent upon the dissimilarity metric and the method of cluster analysis chosen [4]. In this study, we used the Gower dissimilarity metric owing to the inclusion of categorical, ordinal and continuous data. Similar to previous studies in this domain, we elected to perform agglomerative hierarchical cluster analysis [2, 8, 10].

With the above caution in mind, we believe that our analyses have identified clusters of patients that are physiologically and clinically sensible. In analysing the table containing combined data types, we have identified two clinically relevant clusters of patients. The first is an older group, who have a higher rate of falls and are more likely to have mass lesions on CT. The second is a younger group, who have suffered multi-trauma and have markers of severe injuries on admission with more hypoxia, hypotension, pupil abnormalities and lower GCS scores. As would be expected from existing predictive models [5, 7], these patients tended to have poorer outcomes as assessed by the GOS.

In addition, analysis of the table containing only continuous data revealed a relevant cluster of patients who tended to have lower SaO2 and PaO2 values. It can be reasoned that this physiological derangement led to them also having longer transfer times from the PNSH to the NSH.

This project has successfully demonstrated the feasibility of using cluster analysis in the exploration of a large, mixed database of TBI patients. The unsupervised identification of clinically recognisable groups of patients supports the validity of the results presented. We now plan to apply similar analyses to other tables in the BrainIT database that contain high-frequency physiological data including blood pressure and intracranial pressure. The ultimate aim is to identify demographically and physiologically distinct groups of patients who will be amenable to specific treatment strategies.