Introduction

The ability to evaluate the severity of the disease state in critically ill patients is important as a measure of intensive care unit (ICU) performance as a means of providing patients’ relatives with information regarding outcome, as a guide for resource allocation, and as a way of stratifying patients in clinical research [1]. Currently available prediction models such as the Acute Physiology and Chronic Health Evaluation (APACHE) II [2], Simplified Acute Physiology Score (SAPS) [3], and Mortality Probability Models (MPM) [4] use values taken within the first 24 h of an ICU stay. However, these scores ignore the many factors that can influence patient outcome during the course of an ICU stay beyond the first 24 h. Some researchers [5, 6] have advocated the sequential application of these systems, possibly with correction for other factors, but such use is as yet experimental, except for MPM [7]. Sicignano et al. [8] observed that the discriminative power of SAPS decreases over time, retaining its predictive power only for patients who stay in the ICU no more than 5 days. These scoring systems have been evaluated in several large Canadian and European studies, which have confirmed their predictive accuracy in those settings. The area under the receiver operator characteristic (ROC) curve for the three models ranged from 0.74 to 0.86 in these studies [9, 10, 11]. Since these scoring systems were developed using large population databases, evaluation of predictive accuracy in isolated ICUs may not always fit well because of smaller numbers of patients [12] and differences in case-mix [13, 14]. Comparing logistic regression models and artificial neural networks, Clermont et al. [12] observed that as development size decreases model performance deteriorates rapidly. Increasing the number of patients with a particular condition causes the discrimination and calibration of the MPM II to deteriorate [13] and the mortality ratios (observed deaths divided by the predicted deaths) predicted by the APACHE II to vary widely [14]. The application of such models to individual ICU performance is thus difficult. Patel and Grant [15] observed that although the predicted mortality using APACHE II, MPM II, and SAPS II is similar to the 95% confidence interval (CI) of the observed mortality, there is a poor fit to the model, impairing validity of the result. This finding may be the result of differences in quality of care, differences in case-mix, small numbers of patients, or delay in ICU admission.

The Sequential Organ Failure Assessment (SOFA) assesses patients for organ dysfunction not only at ICU admission but serially during the ICU stay and was first developed to evaluate morbidity [16]. Organ dysfunction is related to both morbidity and mortality [17, 18], and it has been shown that the SOFA score, using either the sum of the maximum scores for each system (SOFA Max), the admission value, or the changes in the first 48 h, is related to mortality [19, 20]. The aim of this study was to develop an ICU mortality prediction model based on organ dysfunction measurements and, using this model, to compare differences in ICU mortality in different countries.

Materials and Methods

A database collected by a working group on sepsis-related problems of the European Society of Intensive Care Medicine was analyzed. The database consisted of data collected in 40 ICUs from 16 countries in Australia (1), Europe (35), North America (1), and South America (3). Data were collected on all 1,449 patients admitted to the participating ICUs during a 1-month period, excluding those with a length of stay (LOS) less than 2 days following uncomplicated surgery. Coronary patients were included. Further details of the database formation are provided elsewhere [21]. For each patient basic demographic data were collected at admission, and the variables needed to construct the SOFA score were collected at admission and every 24 h thereafter. For a single missing value a replacement was calculated using the mean value of the result preceding and that following the missing one. When more than one consecutive value was missing it was considered as a missing value in the analysis. Patients with an LOS of at least 2 days were identified and classified by country. The six countries with the largest numbers of patients (Brazil, Belgium, France, Italy, Spain, United Kingdom) were selected for further analysis. Univariate analysis was performed using nonpaired t tests for continuous variables and χ2 for categorical ones to assess those related to mortality. Variables with a p value less than 0.15 were entered into a logistic regression model to identify those significantly associated with mortality. Two logistic regression models were constructed, one based on the SOFA Max, which was the best predictor in previous studies [19, 20], and another (SOFA Max-infection) based on the variables that remained significant (p<0.05) after multivariate analysis. Both models were applied individually to obtain a predicted mortality for the whole population and for each country. The variables chosen were the total SOFA score at admission, the Δ value between the SOFA score at 48 h and the admission SOFA, the SOFA Max (defined as the sum of the maximum value for each organ category during the ICU stay), and the Δ value between the admission SOFA and the SOFA Max. Calibration of the model was assessed with Hosmer-Lemeshow’s goodness-of-fit statistics [22], and the Pulkstenis-Robinson D statistic [23]. Values of the Pulkstenis-Robinson D statistic should be used in conjunction with the Hosmer-Lemeshow statistic when categorical and continuous variables are included in the analysis as data are cross-classified by categorical covariates eliminating the risk of grouping together observations with similarly fitted probabilities but different covariate patterns, which may occur with the Hosmer-Lemeshow approach. Assessment of calibration usually relies on p values being higher than 0.05. Another strength of the D statistic is that even when p values are above 0.05, significant flaws in the model can be perceived and interactions can be added to the model improving p values, although this must be done with careful clinical reasoning. The interpretation of the Pulkstenis-Robinson D statistic involves two steps: First, the p value should be analyzed and, if less than 0.05, the individual values for each strata observed. Variables that have values deviating from zero (±5) should be suspected of having important covariates. Even if p is not less than 0.05, if individual variables have values that deviate from zero, they may be thought of as important for the model if clinical reasoning is in agreement. ROC curves were constructed, and discrimination was assessed by the area under the curve (AUC) [24, 25].

The 95% CI for the observed ICU mortality was calculated and compared to the ICU mortality predicted by the models. One-way analysis of variance was used to compare means for continuous variables across different countries, and when differences were found, Fisher’s least significant differences test was applied to compare groups. All tests were two-tailed. When multiple comparisons were made, Bonferroni’s adjustment was applied. A p value less than 0.05 was considered significant. Data are presented as mean ±SD, except where otherwise indicated.

Results

The study included 748 patients from six countries. Their mean age was 54.9±18.6 years, 63.5% were men, and ICU mortality was 21.5%. Patient characteristics in the various countries are summarized in Table 1. There were fewer than 1% missing data. Differences among countries were observed in the age of the patients and in admission diagnosis. The differences in mean admission SOFA and mean SOFA Max scores among countries are shown in Table 1. Differences in ICU mortality did not reach statistical significance.

Table 1 Patient characteristics in countries A–F

Differences between survivors and nonsurvivors were observed. Survivors were younger (age 53.3±18.9 vs. 60.7±17.0 years, p<0.0001), had a lower infection rate (27% vs. 55%, p<0.0001), lower admission SOFA (4.0±3.0 vs. 7.3±4.0, p<0.0001), lower SOFA Max (6.7±4.4 vs. 13.5±5.0, p<0.0001), lower Δ Max (2.8±2.9 vs. 6.3±3.8, p<0.0001), and lower Δ 48 h (0.0±2.4 vs. 1.3±3.4, p<0.0001). Medical admissions were more common in nonsurvivors (56 vs. 44%, p<0.01); trauma and coronary admissions were more common among survivors (14 vs. 6%, p<0.01 and 7 vs. 2%, p<0.01, respectively). After multivariate logistic regression analysis with ICU mortality as the dependent outcome the only variables that remained significantly associated with death were SOFA Max, infection, and age (Table 2). The Δ between admission SOFA and SOFA Max was not included in the model because of redundancy.

Table 2 Multivariate logistic regression on SOFA Max-infection model with significant variables

The logistic regression model based on the SOFA Max model showed good calibration, as assessed by the C and H statistics (p=0.54 and p=0.95, respectively). However, when assessed by the Pulkstenis-Robinson method for the presence of infection, analysis of the D statistics individually showed large values for both the infected (from −6.788 to +7.894) and noninfected (+7.464 to −6.565) patients, demonstrating an obvious underestimation of mortality for infected patients, as well as an overestimation for noninfected patients, and highlighting the need to include infection as a covariate in the model, although the p value was not significant (p=0.157). After inclusion of infection, we no longer observed these discrepancies (values were within −1.782 to +1.850), and the p value was 0.796 for the D statistic. The AUC for this SOFA Max model was 0.840 (95% CI, 0.804–0.872) and 0.845 (95% CI, 0.809–0.876) after the addition of infection as covariate.

Calibration for the model based on SOFA Max, infection, and age (SOFA Max-infection model) showed p values for the H and the C statistics of 0.72 and 0.37, respectively; the calibration curve is shown in Fig. 1. The D statistic showed a p value of 0.825. Discrimination was assessed by the AUC (0.853, 95% CI: 0.817–0.884). Subsequent analysis was performed with this SOFA Max-infection model.

Fig. 1
figure 1

Predicted vs. observed ICU mortality. Continuous line Observed vs. predicted, dashed line line of identity

The observed ICU mortality for each country was compared to the predicted ICU mortality using the SOFA Max-infection model. There were no statistically significant differences in observed and predicted ICU mortalities, except for one country, which had a higher than expected mortality for the whole population (28.3 vs. 19.1%, p<0.05) and for the noninfected patients (21.4 vs. 12.6%, p<0.05). The infected patients in this country had a higher than predicted ICU mortality but this was not statistically significant (44.8 vs. 34.9%; 95% CI for observed: 26.7–62.9%). Analyzing the different prediction levels, ICU mortality was higher in every decile in this country (Fig. 2). Calibration was not performed for individual countries because of the limited number of patients.

Fig. 2
figure 2

Predicted vs. observed ICU mortality in the one country with higher observed than predicted mortality. Continuous line Observed vs. predicted; dashed line line of identity

Discussion

Outcome prediction is a fundamental tool in critical care. The available severity scores such as the APACHE II and the SAPS predict mortality based on physiological variables collected in the first 24 h of ICU stay, ignoring the fact that morbidity and mortality are very closely correlated, and that changes in the initial parameters may influence patient outcome; indeed using the MPM model, Rué et al. [7] observed that the best estimate of hospital mortality was the probability of death on the current day. In an ICU environment morbidity can be described as multiple organ dysfunction syndrome and has been observed with several acute states commonly seen in the ICU, including hemorrhagic shock [26], infection [27, 28], acute pancreatitis [29], burns [30], shock [31], and trauma [32]. The SOFA score, based on six independent and simple to obtain variables was initially presented for assessing morbidity in septic patients [16] but has been validated also in trauma [33] and in general ICU patients [21]. Recently it has been shown that the admission SOFA score, SOFA Max, and the changes in SOFA over the first 48 h are correlated with mortality [19, 20].

Consistent with the idea that mortality is related to severity of organ failure, our results show that although the admission SOFA score is related to ICU mortality, in a multivariate analysis only the SOFA Max is significantly related to ICU mortality. The SOFA-Max model showed a good calibration by the Hosmer-Lemeshow statistic, but, as observed by Pulkstenis and Robinson, this method may not detect poor calibration when an important binary covariate is missing [23]. Indeed, after stratification for the presence of all infections combined, we observed significant disparities in the observed and predicted mortalities for infected and noninfected patients. After adding infection as a covariate (SOFA Max-infection model) performance improved. Thus when using the SOFA score to evaluate the severity of the disease process, it should be adjusted for the presence of infection. Other studies have compared patients with the systemic inflammatory response syndrome with or without infection, noting higher mortality rates in infected patients [34]. Our data are even more exciting since the SOFA score represents a more complex analysis of organ function. The observation that infection influences the probability of death, independently of the degree of organ failure measured by the available clinical scores, leads us to believe that the presence of infection should be included in these scores, and in addition, if we want a more precise score, we should measure not only the degree of organ damage but also the inflammatory and coagulation abnormalities present in sepsis.

The calibration curve for the SOFA Max-infection model showed that the observed ICU mortality was always close to the line of identity, a finding that differs from what has been reported in the literature, in which overestimation or underestimation of mortality in sicker patients is usually observed [15, 35, 36]. This may be due to the fact that as the SOFA Max changes with the clinical course of the patient, greater accuracy can be obtained in all deciles of prediction.

Discrimination of the SOFA Max-infection model was good (AUC: 0.853) and similar to published values for the APACHE III [37, 38] and logistic organ dysfunction score [39] after customization, despite the fact that the SOFA score is simpler to compute and does not depend on admission diagnosis. This facet is highlighted by the observation that although medical, coronary, and trauma admissions had a significant univariate relationship with ICU mortality, after adjustment for the SOFA score their importance was lost, a finding that makes sense in view of the fact that the SOFA score actually measures organ dysfunction, and this may be a common pathway for many different disease states.

International differences were assessed, looking for variations in predicted vs. observed mortality. Since ICU selection was not random, these data should be interpreted carefully, as they may not be representative of each country. In this analysis, only one country had a higher ICU mortality than predicted, both in the overall population and in the noninfected patients, and this finding was present in all deciles of prediction. This discrepancy may be due to differences in case-mix, nonmeasured clinical, and nonclinical [40] variables that may be strongly related to mortality [35, 41, 42], differences in quality of care, or differences in cultural aspects such as resource allocation and policies regarding the limiting of therapy. Our data cannot solve this difficult question, although differences in case-mix should not be important after customization [43].

Our study differs from others as the models were applied to subpopulations of the original development data. This may impair external validation of the SOFA Max-infection model, but comparisons within groups should be enhanced, as opposed to the poor calibration found when other models were applied to independent populations [15, 44, 45]. It is also important to indicate that the present study included a limited number of patients from each country, thus preventing a more precise analysis, as simulations have shown that for smaller populations the fit of the model may deteriorate [12], and as yet we do not have adequate statistical methods to calibrate models in limited samples. Another potential limitation is that we convert a continuous variable (probability of death) into a dichotomous variable (dead or alive), but this is an inherent problem in studies of this type. In addition, while there was an apparent higher ICU mortality in patients from one country, the data do not allow any conclusions to be made regarding the reasons behind this finding; in particular, we cannot say from these results that the standard of care in that country is any less than in any of the five other countries. Rather, our findings should encourage other studies into international differences in ICU mortality rates, both in terms of absolute numbers and cause. Importantly, too, this model should not be used to evaluate the risk of death in individual patients.

In conclusion, the SOFA Max, adjusted for age and infection, is significantly related to ICU outcome, independently of admission category, but nonmeasured variables may still play an important role in differences found between predicted and observed ICU mortality.