Introduction

The Full Outline of Unresponsiveness (FOUR) Score is a useful clinical instrument for the assessment of the level of consciousness [1]. It includes four elements (eye and motor response, brainstem reflexes and respiratory pattern), and is designed to overcome some of the downsides [2,3,4,5,6,7,8] of the Glasgow Coma Scale (GCS) [9, 10].

The FOUR Score provides further neurological details compared to the GCS since it includes brainstem reflexes and respiratory pattern assessment, the eye-tracking test is added to the traditional eye-opening response, it lacks a verbal element (which is replaced by hand-gesturing), and myoclonic status epilepticus is also added in motor assessment [1, 11]. All these constitute potential advantages of the FOUR Score, especially in assessing critically ill and intubated patients, where the GCS cannot be fully applied (since the verbal element is frequently pseudoscored), while the additional information regarding the brainstem (reflexes and respiration) can be used to better sub-categorize patients with a GCS total score of 3 [1, 11]. Moreover, recently, its use over the GCS has been recommended by the European Academy of Neurology for the assessment of patients with disorders of consciousness in the acute setting [12].

The FOUR Score has been validated in many languages, most recently in Greek [13]. Its clinimetric properties, including the prognostic value, have also been investigated [11, 14]. However, to our knowledge, even though the level of agreement between different raters has previously been assessed [11, 13, 14], studies investigating the impact of the examiner’s education and experience on the FOUR Score’s predictive value are lacking.

This study aimed primarily (1) to estimate the predictive value of the Greek versions of the FOUR Score and the GCS, (2) to test the hypothesis that the FOUR Score is a comparable predictor of the patients’ short- and long-term outcome to the GCS, and (3) to test the hypothesis that the scales’ predictive ability remains unaffected when applied by raters with different health-care background and experience. The secondary aim was to estimate the value of the scales in identifying comatose state.

Methods

Study design and setting

This prospective observational study followed the STROBE cohort reporting guidelines [15]. Participants’ recruitment was conducted between October 1st, 2018, and December 31st, 2020, in the Department of Neurosurgery, at Hippokration General Hospital, Thessaloniki, Greece, a 24-bed unit. The unit also treats patients in need for closer monitoring that require intense care. Patients with neurosurgical pathologies treated in the intensive care unit (ICU) of the hospital were also included.

Participants were closely followed during their hospitalization. Clinical and radiological data were collected. Level of consciousness was recorded on admission, on clinical deterioration and at discharge. Outcome was assessed at discharge and at 6 months.

The study protocol had been previously agreed by all authors and approved by the Ethics Committee of the hospital (Ref. Nr. 985-2017). The National Data Protection Authority was notified on the conduction of the study (Ref. Nr. 850-2018) and legal consent was obtained from all patients capable of providing it or by proxy when deemed necessary. The ethical standards set by the 1964 Helsinki Declaration and its later amendments were followed.

Variables and data collection

The level of consciousness was assessed with the FOUR Score and the GCS on admission, within 12 h from presentation, by four raters: author DMA (a Consultant Neurosurgeon who was considered the main researcher and whose assessments were used as reference for comparison), author PPT (a Consultant Neurosurgeon and Associate Professor of Neurosurgery), a resident (6 in total, equally distributed through all possible years of training) and a nurse (8 in total, all of which had graduated from accredited nursing schools, with at least 10 years of experience). The inter-rater reliability of the Greek version of both scales has previously been evaluated, and excellent agreement was found for all pairs of raters [13].

The examiners were blinded to each other and the ratings were performed independently within 1 h at maximum. A full rating session had to be repeated when patient’s deterioration was attributed to neurosurgical pathology, and the updated assessment was used for calculations. Raters were previously informed on the methodology and the objectives of the study and received vigorous training on the application of the scales by the two certified Neurosurgeons [13]. Furthermore, they were given written instructions for the application of the two scales and scoring sheets, which they were asked to use during their assessments. Participants were assessed anew by the main researcher with both scales at discharge.

Outcome was recorded using the modified Rankin Scale (mRS) [16, 17] and the Glasgow Outcome Scale—Extended (GOSE) [18, 19] both at discharge and at 6 months, by the main researcher. For the evaluation of long-term outcome at 6 months, patients were reached by phone, a practice commonly used in similar studies [20, 21]. For the secondary aim of the study, comatose patients were classified as such by the main researcher based on the standard definition of coma [22]. To avoid bias, these assessments were blinded to any consciousness assessments, and preceded the main researcher’s own ratings.

Results from the level of consciousness and outcome assessments were collected directly during hospitalization. Additional data regarding the patients’ characteristics and their clinical course were extracted during hospitalization and at discharge. Data was completely anonymized and digitally documented in a Microsoft Excel© 2019 (Microsoft Corporation, Redmond, Washington, USA) spreadsheet. The procedure was in full compliance with the current legislation.

Eligibility criteria

Patients meeting the following criteria were enrolled: (1) age ≥ 18 years, (2) main indication for clinical care resulting from pathologies that required neurosurgical care, constant neurosurgical assessments and possible intervention, (3) hospitalization in the Neurosurgical Department and/or the ICU; patients transferred to or from those departments during their hospitalization were also included, and (4) impaired level of consciousness or patients with initially normal responsiveness but in need for consciousness monitoring due to risk of deterioration.

Subjects that presented with one or more of the following were excluded from the study cohort: (1) denial, withdrawal or inability to acquire legal consent, (2) unavailability of all examiners to obtain a reliable and blinded patient assessment within 1 h, (3) inability of trained examiners to obtain complete patient assessment within 12 h from presentation, (4) incomplete assessment of the patient’s worst level of consciousness, (5) inability to assess the patient’s responsiveness at discharge, (6) failure to evaluate the patient’s outcome at discharge and/or at 6 months, (7) dementia, mental illness, sedatives, neuromuscular junction blockers, alcohol or addictive substances, in case the reliability of the assessments could be jeopardized, and (8) cases with any form of missing data. ICU patients were only evaluated after the administration of the aforementioned agents was discontinued long enough according to the ICU protocols, ensuring that a reliable clinical assessment would be possible.

The need for consciousness monitoring was set as an eligibility criterion since the exclusion of patients with initially normal consciousness would have resulted in selection bias. Furthermore, only the worst recorded values were included. However, this could also be subject to bias, since a variety of complications can also interfere with the patients’ responses. Thus, the FOUR Score and GCS values after clinical deterioration were included in the calculations only if they were directly linked to the main pathology.

Statistical analysis

Descriptive statistics are presented as means ± standard deviation or medians. Normality of data was checked with the Kolmogorov–Smirnov test. p values < 0.050 were considered statistically significant.

The outcomes of interest were mortality at discharge (in-hospital mortality) and at 6 months (long-term mortality), poor outcome at discharge (short-term outcome) and at 6 months (long-term outcome). Poor outcome was defined as mRS values of 3–6 and GOSE values of 1–4 [1, 23, 24].

The association of FOUR Score and GCS (total and sub-scores) on admission and at discharge, and the difference between the values on admission and at discharge of the total score for each scale (ΔFOUR, ΔGCS) with the outcomes of interest was also assessed.

Areas under the receiver operating characteristic curve (AUCs) for all raters and in comparison with the main researcher were estimated, with sensitivity, specificity and cut-off values of maximum Youden index. The formula proposed by de Long [25] was followed for AUC comparisons [24, 26,27,28,29], with Bonferroni correction for multiple comparisons [30]. AUCs with sensitivity, specificity and cut-off values of maximum Youden index were also estimated regarding the ability of the total FOUR Score and GCS values to identify comatose patients.

A power analysis was performed to select the minimum number of participants that would reach an adequate statistical strength. AUC values of approximately 0.900 were expected, and the level of accepted difference between them was set at 5%. It was found that 84 subjects would reach a power of 80% and a 5% level of significance, which were considered appropriate for the purposes of the current study. Moreover, previous similar single-center prospective studies assessing the predictive value of the two scales of interest presented with a similar median sample of 90 patients [11].

The software packages MedCalc© version 20 (MedCalc Software Ltd, Ostend, Belgium) and G*Power© 3.1 (Heinrich-Heine-Universität Düsseldorf, Germany) were used for the analysis.

Results

Participants’ characteristics and assessments’ data

From the 489 subjects that were found eligible, 403 were excluded according to the aforementioned criteria, and 86 were finally enrolled. The demographic data are shown in Table 1. None was excluded due to missing information including outcome assessment at 6 months. Forty patients died during hospitalization, thus only 46 patients were assessed at discharge.

Table 1 Demographic data, baseline characteristics and results from the assessment of consciousness of participants using the FOUR and the GCS

Short-term outcome prediction

Table 2 presents the relationship between the two scales’ scores and in-hospital mortality and poor outcome at discharge. AUC values for the total scores of GCS and FOUR exceeded 0.880, suggesting a very good predictive value for both scales (Fig. 1). No significant difference between GCS and FOUR Score was noted (p = 0.100 for in-hospital mortality, p = 0.720 for mRS 3–6 and p = 0.830 for GOSE 1–4). The cut-off values for the FOUR Score were: ≤ 13 for in-hospital mortality (sensitivity 92.50%, specificity 82.61%), ≤ 15 for mRS 3–6 (85.29%, 83.33%) and ≤ 13 for GOSE 1–4 (68.18%, 100%). The respective values for the GCS were ≤ 9 (85%, 84.78%), ≤ 12 (82.35%, 88.89%) and ≤ 10 (66.67%, 100%).

Table 2 AUC values for the total (in bold) and partial scores of FOUR Score and GCS for short- and long-term outcomes
Fig. 1
figure 1

AUCs comparing the FOUR Score with the GCS in predicting short- and long-term mortality and poor outcome. Both scales showed great predictive ability (>0.880), without significant difference between them (p > 0.050). AUC, area under the receiver operating characteristic curve; FOUR, Full Outline of Responsiveness Score; GCS, Glasgow Coma Scale; GOSE, Glasgow Outcome Scale—Extended; mRS, modified Rankin Scale

Long-term outcome prediction

The results from AUC analysis for long-term outcome are presented in Table 2. The total score for both scales was again a constant predictor of outcome. AUC values exceeded 0.910 for both mortality and poor outcome at 6 months (Fig. 1). No significant difference between the predictive value of the two scales was noted (p = 0.490 for mortality, p = 0.070 for mRS 3–6 and p = 0.050 for GOSE 1–4). The cut-off values for the FOUR Score were ≤ 14 for mortality (sensitivity 92.16%, specificity 74.29%), ≤ 13 for mRS 3–6 (77.19%, 96.55%) and ≤ 13 for GOSE 1–4 (77.59%, 100%). For the GCS were ≤ 12 for mortality (96.08%, 74.29%), ≤ 12 for mRS 3–6 (94.74%, 86.21%) and ≤ 12 for GOSE 1–4 (94.83%, 89.29%).

The aforementioned calculations were conducted for the recorded FOUR and GCS scores at discharge, and for the difference in the values between admission and discharge (Online Resource 1). Assessments at discharge seem to be reliable outcome predictors as well, but ΔFOUR and ΔGCS failed to show any significant predictive value.

Predictive value among raters

AUC values were calculated based on the ratings of the two less experienced raters (a resident and a nurse) and the second consultant, which were then compared with the reference rater (Table 3). AUC values were high (> 0.860). When compared with the reference rater’s assessments, no significant difference was seen.

Table 3 AUC values for raters with different experience for predicting all outcomes of interest, and the results from the comparison with the reference (in bold) rater

Coma identification

Both GCS and FOUR Score scored high in identifying coma (AUC values exceeding 0.950 with p < 0.001 for all raters) with no significant difference compared with the main rater (Table 3; Fig. 2). The comparison between the two scales also did not show any significant superiority (p = 0.675). The cut-off values for the FOUR Score were ≤ 11 for the reference rater (sensitivity 100%, specificity 91.38%) and residents (96.43%, 87.93%), and ≤ 10 for the consultant (92.86%, 93.10%) and nurses (96.43%, 89.66%). The respective values for the GCS were ≤ 8 for the reference rater (100%, 86.21%) and the consultant (96.43%, 86.21%), and ≤ 7 for residents (96.43%, 86.21%) and nurses (100%, 87.93%).

Fig. 2
figure 2

AUCs for the FOUR Score and the GCS in identifying coma, when applied by various examiners. Both scales scored high (> 0.950). AUC, area under the receiver operating characteristic curve; FOUR, Full Outline of Responsiveness Score; GCS, Glasgow Coma Scale

Discussion

In the present study, the predictive value of two coma scales, the FOUR Score and the GCS, was assessed. Both scales showed very good to excellent ability to predict short- and long-term mortality and poor outcome with no significant difference between them. The predictive value was not influenced by the rater’s background and experience, and both scales successfully identified comatose patients. The difference between admission and discharge values failed to show any predictive value, but values at discharge seemed to be reliable predictors of long-term outcome.

With few exceptions [21, 31], most previous reports investigating the predictive value of the FOUR Score had a follow-up period of maximum 3 months [11, 14]. We evaluated patients at 6 months including also risk factors that may affect mortality and outcome not directly linked to the main pathology. This was evident in neurosurgical patients who frequently have long periods of hospitalization and recovery, and are prone to various complications even after discharge.

Although the GCS was found to be superior in predicting long-term outcome, this was not statistically significant and excellent predictive ability was seen for both scales at 6 months. Although the FOUR Score overall performed better in the prediction of in-hospital mortality, a clear superiority over the GCS was not seen. Therefore, it is justified to support that the FOUR Score is not inferior to the long-used GCS, and that both scales can be included in predictive models for neurosurgical pathologies. The current findings on the predictive value of the scales are overall in agreement with the existing literature. Indeed, previous studies have repeatedly demonstrated the ability of the FOUR Score and the GCS to predict short- and long-term outcome [11, 14]. However, the previously reported advantage of the FOUR Score in predicting in-hospital mortality [11] was not confirmed by the present results.

In the current study, a methodologically consistent comparison between raters with different experience and education was done, which has not been previously thoroughly investigated [32, 33]. The predictive value of the FOUR Score remained unaffected by the background of the examiner, an important finding since the scale was recently translated in Greek and there was no previous experience on its use [13]. The predictive ability of the GCS also remained high regardless of the raters’ background. Thus, it can be suggested that both scales can be reliably used even by non-experienced health-care professionals.

There is a limited number of studies assessing the predictive value of the FOUR Score beyond the day of admission, whereas none has used data from discharge [26, 31, 34, 35]. In one study, a superiority of later assessments over those on admission was seen [34]. In the current report, the GCS and FOUR Score values at discharge were tested for their ability to predict long-term outcome. Both showed some predictive value, but, due to insufficient sample size, it could not be tested whether they offer any advantage over admission values. Further, the difference between the two values (on admission and at discharge) failed to show any actual use in predicting patients’ outcome (AUCs significantly < 0.650).

The clinical significance of coma scales should not be restricted to their predictive value. In the current study, both scales performed excellent, with AUC values exceeding 0.950, in successfully recognizing coma for all raters. The widely used in clinical practice limit of 8 in GCS [9] was more or less confirmed by our results, since the cut-off scores were 8 for the consultants and 7 for the less experienced raters. Using the same definition, the corresponding value of the FOUR Score should be close to 11. Indeed, the cut-off values for the FOUR Score were 11 for the main researcher and the residents, and 10 for the second consultant and the nurse, confirming thus the expected results.

The study has some limitations. It is a one-center analysis, limited to patients of neurosurgical concern. More than 80% of eligible patients were eventually excluded from the study. Even though the appropriate study size was reached, it would have been preferable all eligible patients to be included. However, since the reasons for exclusion were mostly the unavailability of four trained raters and factors jeopardizing ratings’ reliability (e.g., patients under sedation and with severe dementia or mental illness), we put every possible effort to maintain the study’s rigorous and consistent methodology. Few patients under mechanical ventilation were included and those under sedation were excluded from the study. With the exception of the two consultants, it was not feasible to obtain four constant raters, which might have affected the ratings’ consistency. However, all raters underwent similar training for the application of the scales, which has already proved to be sufficient in minimizing inter-rater disagreement and the risk of bias [13]. The use of the worst level of responsiveness as a predictor is a common practice for similar studies in the literature [11], but this might be affected by factors unrelated with the main pathology. Furthermore, slight clinical deterioration that remained unnoticed may have occurred. Outcome assessments and the classification of patients as comatose or non-comatose were performed by the main researcher before his own consciousness assessments and blinded to any other ratings, and he was in no case the treating physician. This was decided to avoid bias and increase consistency. Further, all long-term outcome assessments were performed by phone interviews using standardized questions. The sample needed for reliable AUC comparisons was not reached for the assessments at discharge, thus potential advantages over the admission assessments could not be revealed. Finally, considering the 6-month follow-up period, the possibility that factors other than brain damage might have resulted in increased mortality and worse outcome cannot be ruled out.

Nevertheless, this is a prospective study implementing a rigorous design, which provides an in-depth analysis of the prognostic value of two important clinical tools for the evaluation of the level of consciousness. The follow-up period was longer than usual, no patient was lost, and the impact of different health-care background and experience between the raters was thoroughly examined using a consistent methodology. The assessment at discharge and the difference between admission and discharge has been tested for prognostic significance for the first time. Lastly, the role of the scales in diagnosing coma was assessed.

Conclusions

The FOUR Score and the GCS showed excellent predictive value, both for short- and long-term outcome. Their performance remained unaffected by the rater’s background and experience. Both scales performed equally well in identifying coma, while discharge assessments need further investigation for their utility. The current results need further verification for intubated patients.