Clinical implications

What is already known?

  • The Sequential Organ Failure Assessment (SOFA) was designed to focus on organ dysfunction and morbidity, with less of an emphasis on mortality prediction.

  • It is still not clear which of the Full Outline of UnResponsiveness (FOUR) score, Glasgow Coma Scale (GCS; specific predictive models), and SOFA models has the better calibration and discrimination powers for predicting outcomes in critically ill children.

What does this article add?

  • In the current study, we found a good ability of SOFA for mortality prediction for the pediatric population.

  • Among FOUR score, GCS, and SOFA, the GCS was the best tool in terms of discrimination and calibration power to predict mortality.

Introduction

Prognostic scoring systems have been introduced and developed since the 1980s, to describe severity of illnesses in an objective, quantitative, and uniform manner. Using scoring models can be beneficial in objective measures for inter- and intraunit comparisons, quality assessment, cost–benefit analyses, evaluation and comparing outcome and survival, stratifying patients prior to randomization in clinical trials, and in clinical decision making [1, 2]. The discrimination and calibration power are two key characteristics that should be met by all prognostic scoring systems. The calibration can deteriorate over time mostly due to case–mix properties and altered quality of care, which often results in an overestimation of mortality rate, and subsequently poor discrimination [3]. Therefore, periodically updating a scoring model to reflect contemporary practice and patient characteristics is critical for maintaining accuracy. The calibration power explains how the scoring model performs over a wide range of predicted mortalities and discrimination power describes the accuracy of a given estimate (the ability to discriminate between survivors and nonsurvivors).

The SOFA, GCS, and FOUR score are well-known scoring models in intensive care units. Greater scores in the first 24 h of intensive care unit (ICU) admission is accompanied by greater mortality rates, and it should be considered that using and implementation of these predictive models needs to an external validation in a new population, supporting the generalizability.

The SOFA was originally designed to serially assess the severity of organ dysfunction in patients who were critically ill from sepsis [4]. The GCS is the most known scoring system used to describe the level of consciousness in a patient following a traumatic brain injury (TBI), and the FOUR score is an alternative model that has been developed and validated and may have better utility than the GCS in coma diagnosis, mainly by including a brainstem examination [5].

Ha et al. [6] performed a retrospective study of 54 consecutive pediatric oncology patients requiring mechanical ventilation over 72 h in the pediatric ICU (PICU). They found that serial evaluation of SOFA score during the first few days after PICU admission was a good predictor of prognosis, and the other independent factors including initial SOFA score and delta-SOFA score during the first 72 h closely were correlated with mortality.

In a retrospective cohort analysis of 9959 patients suffering from severe TBI, Emami et al. [7] evaluated by using pupillary status and GCS score whether patients with severe traumatic brain injury (TBI) who are ≤15 years old have a lower mortality rate and better outcome than adults with severe TBI. Based on the findings, a total of the overall mortality rate and the mortality rate for patients with a GCS of 3 and bilaterally fixed and dilated pupils (19.9% and 16.3%, respectively) were lower for the pediatric than adults patients (80.9% vs 85%, respectively).

In a cohort study of 55 TBI patients, Nyam et al. [8] assessed whether the FOUR score can be used as an alternate method to the GCS in predicting ICU mortality in TBI patients. The endpoint of observation was mortality when the patients left the ICU. The area under the curve (AUC) of each predictive models was used to compare the predictive power between the four models. The AUC was 74.47% for the FOUR score, 74.73% for the GCS, 81.78% for the Acute Physiology And Chronic Health Evaluation II (APACHE II), and 53.32% for the Therapeutic Intervention Scoring System (TISS). They reported that the FOUR score has the similar predictive power of mortality compared to the GCS and APACHE II. They proposed that the FOUR score can be used as an alternative to the GCS in the prediction of early mortality in TBI patients in the ICU.

To the best of our knowledge, the performance of these three scoring models vary and it is still controversial which predictive model is more suitable for predicting outcomes in critically ill children [8,9,10]. Repeated and regular recalibration of predictive models should be undertaken to provide well-validated predictive models to predict outcomes. Up to now, there have been no studies evaluating the predictive value of these three models together in critically ill children. The aim of the current study was to evaluate the performance of the SOFA, FOUR score, and GCS in medical and surgical ICUs.

Methods

In a prospective observational cohort study, 90 consecutive pediatric patients admitted to the MICU (medical ICU) and SICU (surgical ICU) in two university hospitals enrolled into the study, from July 2014 to October 2015. The inclusion criteria were age ≤18 years and the patients with an ICU length of stay (LOS) less than 24 h. Those who were brain dead at the time of admission were excluded.

To find the ability of FOUR, GCS, and SOFA scores (three independent variables) for mortality prediction, we used the logistic regression test. With a predetermined effect size of 0.50, a significant alpha of 0.05 and a statistical power of 0.80, the desired sample size was calculated to be 76 [11]. We increased the sample size to 90 subjects.

Demographic data including age and gender were registered and for each patient; the SOFA, GCS, and FOUR scores were calculated and recorded in the first 24 h of MICU/SICU admissions. The SOFA score is based on six different scores, one each for the respiratory, cardiovascular, hepatic, coagulation, renal and neurological systems. According to the Ferreira et al. [12] study, regardless of the initial score, the mortality rate when the score is increased is at least 50%, if the score remains unchanged in the first 96 h of admission 27% to 35%, and less than 27% if the score is reduced. The FOUR score (range 0–16) assesses four domains of neurological function: eye responses, motor responses, brainstem reflexes, and breathing pattern. In some studies, it was shown that the FOUR score has better biostatistical properties than GCS in terms of sensitivity, specificity, accuracy and positive predictive value [13]. The GCS (range 3–15) was published in 1974, and serial assessments of patients with traumatic brain injury were the initial indication for use of the scale. Scoring the GCS in patients with endotracheal tube (ETT) or tracheostomy is challenging. As described by Edwards [14], we assigned an ETT or T for scoring the verbal response of intubated patients, so the possible maximum score for an intubated and nonintubated patients will be 10 + ETT or 10 + T and 15, respectively.

All data were recorded initially on a standardized data collection form for SOFA, GCS, and FOUR score then transferred to the SPSS statistical software (IBM Corp., Released 2013, IBM SPSS Statistics for Windows, Version 22.0, Armonk, NY, USA). Calculating the SOFA, GCS, and FOUR scores for each patient, the relationship between scores and patients’ outcomes were analyzed. The primary outcomes of the study were survivors and nonsurvivors from medical/surgical ICUs. By not publishing identifying information, patients’ privacy maintained. There was no intervention in the current study. Pediatric patients who transferred from MICU or SICU to another ward of the hospitals were included in the survivors; and the patients who died or classified as brain dead were included in nonsurvivors. After encoding data, by using SPSS statistical software version 22 (© IBM Corporation and other(s) 1989–2013), the study population characteristics were summarized by using simple descriptive statistics. For continuous variables, the means with standard deviations were used and frequencies with percentages were used for categorical data. By using logistic regression, the association between three predictive models and patients’ outcomes were assessed. Since, SOFA, GCS, and FOUR scores were independent continuous variables, the p-value < 0.05 was considered significant. To evaluate the predictive value of these scoring models, standard tests to measure discrimination and calibration powers were used. The discrimination power of a scoring model reflects the power of distinguishing between nonsurvivors and survivors and can be obtained by calculating the AUC receiver operator characteristic (ROC). In random chance (a diagonal line), the AUC is 0.5, the AUC greater than 0.7 shows a moderate prognostic value, and AUC greater than 0.8 (a bulbous curve) indicates a good prognostic value for the model [15]. The calibration of the model helps to generate estimates of risks that are in accordance with the observed outcomes at different classes of risk; in other words, it represents the agreement between individual probabilities and actual outcomes. In the current study, calibration power of each model was assessed by using the Hosmer–Lemeshow goodness of fit (GOF) test. A p > 0.05 indicates as a well-calibrated model.

Results

A total of 90 pediatric patients admitted to M/SICU were enrolled into the study. The mean age of the cohort was 7.80 ± 4.43 years (range 2–18 years) with a male/female ratio of 2. The overall ICUs mortality rate was 17.8% (16). The basic characteristics of the study population are shown in Table 1.

Table 1 Population characteristics

For the entire cohort of pediatric patients SOFA, GCS, and FOUR scores in the first 24 h of ICU admission were significantly different between the survivors and nonsurvivors. The nonsurvivors showed significantly lower values for GCS and FOUR score and higher values for SOFA in the first 24 h of the ICU stay than survivors (t = 2.97, p = 0.004, t = 3.45, p < 0.001, t = −3.48, p = 0.002, respectively).

The overall performance of the predictive model can be quantified with respect to calibration and discrimination. The calibration power, or reliability, refers to the agreement between observed outcome frequencies and predictions. Discrimination power refers to the model ability to distinguish patients with different outcomes (distinguishing value 0 from value 1 of the dependent variable). The overall performance of three models is compared in Table 2.

Table 2 Comparison of SOFA, GCS, and FOUR score between survivors and nonsurvivors

Analyzing the area under the ROC curves showed that the discrimination power of SOFA, GCS, and FOUR score was moderate in the first 24 h of ICU admission (AUC = 0.751, AUC = 0.729, AUC = 0.787, respectively). To determine the best cut-off scores for all of three models, the best Youden index (sensitivity + specificity −1) was used. Using the cut-off score of 4.5, SOFA predicted the probability of mortality with a sensitivity of 94%, a specificity of 66%, and accuracy of 71%; for the GCS score, a cut-off score of 7.5 showed a sensitivity of 69%, a specificity of 69%, and accuracy of 69%; and for the FOUR score with a cut-off score of 5.5, a sensitivity of 68%, a specificity of 88%, and accuracy of 69% was seen (Table 2).

The Hosmer–Lemeshow χ2 statistic showed that the goodness of fit was good for GCS (χ2 = 2.76, p = 0.60), while the other two models were not well calibrated. The ROC curve was drawn for accessing the predictive accuracy of the three models, (Fig. 1). The predictive accuracy for all three models was similar.

Fig. 1
figure 1

The receiver operator characteristic (ROC) curves for Sequential Organ Failure Assessment (SOFA), Glasgow Coma Scale (GCS), and Full Outline of UnResponsiveness (FOUR) score in the first 24 h after admission at medical and surgical intensive care units. The areas under the curve were 0.751, 0.729, and 0.787, respectively

Using the logistic regression model, each point increase in the SOFA score was associated with a 1.63 times increase in the odds of mortality rate in the ICU (odds ratio [OR] 1.627, 95% confidence interval [CI] 1.192–2.219; p = 0.002), also each point increase in GCS and FOUR scores was associated with a 31.1% and 38.3% decrease, respectively, in the odds of mortality rate (OR 0.689, 95% CI 0.525–0.904; p = 0.007; OR 0.617, 95% CI 0.449–0.847; p = 0.003, respectively). This relation of all three scores with mortality rate remained after adjusting for sex and age; thus the predictive values of these three models were significant for estimating pediatric patients’ outcomes.

Discussion

Our results showed that the mean GCS and FOUR scores were significantly higher and the mean SOFA score was significantly lower in survivors compared to nonsurvivors (t = 2.97, p = 0.004, t = 3.45, p = 0.001, t = −3.48, p = 0.001, respectively). When analyzing the AUCs, all three models had moderate discrimination powers (AUC = 0.729, p = 0.004, AUC = 0.787, p < 0.001, AUC = 0.751, p = 0.002, respectively). The Hosmer–Lemeshow χ2 statistic test showed a good calibration for GCS (χ2 = 2.76, p = 0.60), but the other two models were not well calibrated. Therefore, the GCS showed good performance in predicting outcomes in the pediatric population.

Based on the Youden index, the best cut-off scores for SOFA, GCS, and FOUR score were 4.5, 7.5, and 5.5, respectively. The best cut-off score for SOFA in the Castelli et al. [16] study was also 4.5, and the AUC = 0.731, sensitivity = 73%, specificity = 68%, NPV (Negative Predictive Value) = 86%, and PPV (Positive Predictive Value) = 47% was similar to our findings. The best cut-off score for GCS in the Mahdian et al. [17] study was 4.5, which is lower than our finding; they assessed the predictive value of GCS among 60 brain-injured adult patients. The AUC = 0.947, sensitivity = 95.3%, specificity = 82.4%, NPV = 87.5%, and PPV = 93.2% also showed that the GCS had a better performance in that different setting. In the Okasha et al. [18] study, the best cut-off score for FOUR score was 11 (with a sensitivity of 80% and specificity of 100%), which differs from our findings; the different settings of the two studies, including the sample size, age, and severity of the disease, may explain these differences. In addition, findings of several studies are in agreement with our findings, having cited that higher GCS and FOUR score and a lower SOFA score was significantly associated with lower mortality rate or poor prognosis [6, 8, 17].

There are several studies, inconsistent with our findings, referring to the good performance of these three scoring models in terms of discrimination and calibration. In a prospective study, Gogia et al. [19] compared the predictive value of SOFA and PELOD (PEdiatric Logistic Organ Dysfunction) scoring systems in a PICU. The mean initial sofa was 10.48 ± 2.50 in nonsurvivors vs. 8.41 ± 3.39 in survivors (p = 0.001).

Wang et al. [20] respectively analyzed the predicting value of four different scoring systems for the prognosis of patients with sepsis. A total of 311 patients were enrolled in their study (221 survivors, 90 nonsurvivors, 28-day mortality rate 28.9%). Univariate analysis showed age, length of ICU stay, GCS, and SOFA score within 24 h after diagnosis were significantly different between two groups (all P < 0.05). Moreover, the multiple logistic regression showed age (odds ratio [OR] 1.388, 95%CI 1.074–1.794, P = 0.012), GCS score (OR 0.541, 95%CI 0.303–0.967, P = 0.038) and SOFA scores (OR 3.189, 95%CI 1.813–5.610, P < 0.001) were independent risk factors for sepsis outcome. Analyzing the ROC curves showed that SOFA score was the most powerful model to predict the outcome in critically ill patients with sepsis; for SOFA the AUC was 0.700, when the cut-off score was 7.5 points, the sensitivity was 73.3% and specificity was 58.8%.

Jamal et al. [21] performed an observational study to compare the predictive ability of FOUR score and GCS in children with impaired consciousness. A total of 63 children (5–12 years) with the impaired consciousness of <7 days were included in their study. The AUC for in-hospital mortality for GCS was 0.83 (range 0.7–0.9) and FOUR score was 0.8 (range 0.7–0.9); in addition, the AUC for poor functional outcome for GCS was 0.82 (range 0.72–0.93) and FOUR score was 0.79 (range 0.68–0.9). They concluded that the FOUR score was as good as GCS in predicting in-hospital and 3‑month mortality and functional outcome at 3 months.

In this study, the calibration power was good only for GCS; in agreement with our findings there are several studies reporting that the calibration for predictive models varies in different contexts [1, 22]. The calibration can be separated into two measurable components, bias, and spread, and a third component, which is an unexplained error. The calibration bias suggests that the prevalence of the species in the initial model data and the evaluation data are different, and can arise because of differences in methodology between the studies that collected these datasets, or differences between the regions or settings covered by these studies. It is important to develop models using data that are representative of the settings in which a model is to be applied. Therefore, periodic recalibration of models will warrant the better predictive ability of scoring systems in future studies.

In the current study, the overall mortality rate was 17.8%, which was 6.8% in the Jentzer et al. [23] study, 16.6% in the Lee et al. [24] study, and 44.5% in the Khwannimit et al. [25] study. Differences in severity of illness and the quality of care between ICUs may explain these discrepancies.

By improving training and clinical practice in applying these predictive tools, standardization of assessment across different settings, customizing and individualizing the models, we can hope the more valid and reliable models will be maintained in the future. Several limitations of the current study should be addressed in further research: first, the substantial influence of sample size on model calibration is known. Second, different settings (case mix), and quality of care among ICUs can lead to calibration bias. Ethical considerations have been considered in this study.

Conclusion

The SOFA, GCS, and FOUR score had moderate discriminating ability. In terms of calibration, GCS was the only well-calibrated model; thus, it was superior in predicting mortality rate in pediatric patients admitted to M/SICU. Further large multicenter studies are needed to determine the best scoring systems.