Introduction

In the 1980s the first general mortality prediction models for the pediatric intensive care were developed to standardize mortality between pediatric intensive care units (PICUs) and make outcome assessment between PICUs more objective [1, 2]. Nowadays mortality prediction models are well known in the PICU setting and used in many studies and registries all over the world [3]. Mortality prediction models are essential to adjust for case-mix and severity of illness, and to describe characteristics of diverse subgroup populations [47].

Although mortality rates in pediatric intensive care have roughly halved in more economically developed countries [1, 2, 811], mortality is still an important outcome for benchmarking PICUs [7, 12, 13].

Two important families of general mortality prediction models are available for PICUs: the paediatric index of mortality (PIM) and the pediatric risk of mortality (PRISM). Both models have meanwhile been updated to PIM2 and PRISM3, respectively [2, 911]. The PIM models use data collected at first contact with a pediatric intensivist up to the first hour. The PRISM is available as severity of illness score (noted here as PRISMscore and PRISM3score) and as a predicted risk of mortality based on the most extreme value recorded in the first 12 h (PRISM3-12) or in the first 24 h on the PICU (PRISM and PRISM3-24). PIM2 has recently been recalibrated in the Australian and New Zealand PICUs into the PIM2-ANZ06 and PIM2-ANZ08 for local benchmarking [13, 14].

Although PIM and PRISM models have been externally validated in different settings, they have not been compared simultaneously on specific subgroup populations, fully different in time and place from the development setting [1517]. The aim of this study was to determine the predictive performance of (variants of) PIM and PRISM models in direct comparison to each other within the overall population as well as in specific subgroups.

Methods

Data-setting

We analyzed data from consecutive patients admitted to the eight Dutch PICUs, from February 2006 up to October 2007. After this first period, six centers carried on voluntarily with the study for different periods up to October 2009. Each center collected data for all consecutive admissions up to the last day of the month ending their inclusion period on all models; PIM, PIM2, PIM2-ANZ06, PIM2-ANZ08, PRISM, and PRISM3-24. Data collection for this study was performed as part of the pediatric intensive care evaluation (PICE), a national PICU registry where routinely measured values on admissions to the PICUs is collected for national benchmarking and the maintenance of a clinical database on patient characteristics and outcome [7]. Data quality was assessed following standardized procedures including local audit site visits with re-extraction on approximately 40 randomly chosen admissions at each site.

For direct comparison patients were selected if they fulfilled the study criteria from both PIM and PRISM models [2, 911]. Therefore patients were excluded who were 16 years of age and older, transferred to another ICU and those dying within 2 h in the PICU after admission or not achieving stable vital signs for at least 2 h within the first 24 h. Details on data collection process and quality are presented in Electronic Supplementary Material, (ESM) Tables 4 and 5. Mortality in the PICU was the end point for statistical analysis. Missing values were accepted as normal and hence not contributing any additional risk to the models. The mortality risk for the proprietary PRISM3-24 was provided by the Virtual PICU Systems (VPS LLC, Los Angeles, CA, United States of America) (https://portal.myvps.org).

External validation

The external validity of the PIM and PRISM models was compared directly on the same patients by calibration and discrimination in the total study population as well as in subpopulations [18, 19].

Calibration

The overall prediction, or calibration-in-the-large, was analyzed by the ratio between observed and the expected mortality and expressed as a standardized mortality ratio (SMR) with a 95 % confidence interval based on a Poisson distribution [20]. Overall prediction of observed mortality was considered accurate when the SMR was not significantly below or above 1. The calibration of the models was further assessed graphically. We plotted observed outcomes by deciles of risk, and compared observed to predicted mortality with a nonparametric curve. When the plotted curve is a straight diagonal line (slope 1, intercept 0), the predicted mortality matches observed mortality. The larger the deviation from this ideal diagonal line, the less accurate the calibration [18, 21].

Discrimination

The discriminative ability of each model was analyzed by calculating the area under the receiver operating characteristic (ROC) curve (AUC) [22]. With an AUC of 0.5 the model does not discriminate any better than chance and with an AUC of 1.0 the model discriminates perfectly. Statistically significant differences in AUC were determined by taking 5,000 bootstraps in all subgroups comparing the AUC between two models at a time in each bootstrap [18].

Subgroup analysis

We analyzed subgroups of patients with respect to diagnoses, age and urgency at admission to the PICU, and according to length of stay (LoS) at the PICU. Subgroups with mutually exclusive categories were chosen for their relevance to clinical and validation studies and also considering their sample size [9, 11, 15, 17, 23, 24]. The LoS was categorized with consideration of the distribution at the median (3 days) and the mean (6 days) LoS.

To compare the mortality prediction between models on their SMR more directly within subgroups, the models were customized to the total study population by a logistic recalibration procedure. A logistic regression model was refitted with the linear predictor of a prediction model as the single covariable. This procedure was followed for each of the prediction models considered. This made that predicted mortality equaled the observed mortality in the total population (SMR = 1), and that the average predictive effects were calibrated to the PICE setting (regression slope = 1). The discriminative ability remained the same since the relative weights of risk factors in the models were not altered [18, 21].

Statistical analyses were conducted using SPSS PASW version 17.0.2 (11-03-2009) [SPSS inc. Chicago USA] and R version 2.12.1 (16-12-2010) [The R Foundation for Statistical Computing, Vienna, Austria]. Statistical significance was defined as p < 0.05.

Results

Data

In the total study period, 13,642 consecutive admissions were registered, with 484 deaths. The combined criteria for PIM and PRISM models led to the exclusion of 946 admissions (51 deaths): 564 patients were 16 years of age and older on admission, 352 were transferred to another ICU, 14 patients died within 2 h at the PICU, and 16 patients did not achieve stable vital signs for at least 2 h within the first 24 h. Besides these exclusion criteria one center (601 admissions, 19 deaths) had to be excluded because data quality was insufficient. A further 55 admissions (two deaths) from six different centers were excluded from analysis because of poor data quality. Eventually, 12,040 admissions with 412 deaths were analyzed, originating from seven Dutch centers and admitted from February 2006 up to October 2009 (Fig. 1 and ESM Tables 2, 4).

Fig. 1
figure 1

Flowchart exclusion study population

Overall performance

Calibration

The PIM2 and PIM2-ANZ06 had good overall prediction of mortality, with an SMR not significantly below or above 1 (Table 1). PIM, PRISM and PRISM3-24 predicted mortality significantly higher than observed (SMR < 1) and the most recent regionally updated model, PIM2-ANZ08, predicted mortality significantly lower (SMR > 1). Most patients had a low risk of mortality: 80 % had a predicted risk below 10 % in all models. Mortality in the highest risk deciles was lower than predicted in all models. The PIM2-ANZ06 was best calibrated overall with a slope of 1.00 and intercept 0.06. The PIM2-ANZ08 did not calibrate as well as the PIM2-ANZ06, especially in patients with a predicted risk below 0.2 or a predicted risk between 0.4 and 0.6 (Fig. 2).

Table 1 Overall performance original models by calibration (SMR) and discrimination (AUC)
Fig. 2
figure 2

Calibration plots (original) models

Discrimination

All mortality prediction models had good discrimination, with the PRISM3-24 (AUC = 0.90) performing significantly better than all other models (Table 1). Discrimination ranged between models from PIM (AUC = 0.83), PIM2 (AUC = 0.85), PIM2-ANZ06 (AUC = 0.86), PIM2-ANZ08 (AUC = 0.85), PRISM (AUC = 0.88) to PRISM3-24 (AUC = 0.90).

Subgroup performance

For assessment of subgroup performance we focused on the two most promising models, PIM2-ANZ06 and PRISM3-24, because they performed best on calibration and discrimination. After recalibration to the average mortality in the overall study population the PRISM3-24 had a slightly better calibration plot than the PIM2-ANZ06 (Fig. 3). Further analysis within the subgroups revealed only better calibration for the PIM2-ANZ06 within the urgency subgroup (Fig. 4). The ranking between the models on discrimination was largely the same as in the overall population. The PRISM3-24 had higher AUC than the PIM2-ANZ06 in all categories, and this was statistically significant in half of them (Fig. 5). Details on calibration and discrimination for all models within the subgroups are shown in ESM Table 3.

Fig. 3
figure 3

Calibration plots recalibrated (‘customized’) models

Fig. 4
figure 4

Calibration (SMR) by recalibrated PIM2-ANZ06 and PRISM3-24 within subgroups

Fig. 5
figure 5

Discrimination (AUC) by PIM2-ANZ06 and PRISM3-24 within subgroups

Performance of recalibrated models on diagnoses, age and urgency

In cardiac patients, both surgical and nonsurgical, mortality was higher (SMR > 1) than predicted by PIM2-ANZ06. The actual mortality in neurological nonsurgical patients was higher (SMR > 1) than any model predicted, and in surgical noncardiac patients mortality was lower (SMR < 1) than predicted (Fig. 4). Both PIM2-ANZ06 and PRISM3-24 predicted mortality accurately in all age groups, but the PIM2-ANZ06, as all other PIM models, showed a tendency to predict lower mortality than observed in the younger population. In the elective admissions mortality was lower than predicted (SMR < 1) by the PRISM3-24 (Fig. 4 and ESM Table 3).

Predictive performance according to LoS in PICU

The longer the stay at the PICU, the poorer the models performed. All models were well calibrated for patients discharged within the first 6 days from the PICU, but with a longer stay good prediction became more difficult. Observed mortality was significantly higher than predicted (SMR > 1) by the PIM2-ANZ06 in patients longer than 6 days in PICU. The discriminative ability of all models was very good (AUC ≥ 0.92) for short LoS (<3 days), with PRISM models showing significantly higher AUC than the PIM models. Discrimination of all models declined with a longer stay in the PICU, where only the PRISM3-24 had an AUC above 0.7 (AUC = 0.73) for a stay of more than 6 days (Fig. 5 and ESM Table 3).

Discussion

The PIM and PRISM models performed well in a national PICU registry on their general and subgroup specific predictive capacity. The PIM2(-ANZ06) was best calibrated to the overall population. The PRISM3-24 had significantly better discrimination than all other models and, after recalibration to our study population, was best calibrated over all subgroups.

PIM and PRISM are general mortality prediction models that do not aim at one specific subgroup of pediatric intensive care. Good to excellent performance was found in different subgroups in the original validation settings [9, 11]. Some researchers have questioned the differences in performance of some of these models in specific subgroups [15, 17, 2529]. As in our study, the main problem was found in the calibration and not in the discrimination by these models. Within the overall study population, general over- or under-prediction was easily overcome by logistic recalibration, but this did not solve miscalibration within all subgroups (Table 1 and Fig. 4). In only three of the seven categories of the diagnoses subgroup (injuries, respiratory and miscellaneous diagnoses) did all recalibrated models predict mortality correctly (ESM Table 3). The lower mortality in surgical noncardiac patients and higher mortality than predicted in neurological nonsurgical patients is in line with other studies [15, 17]. All models had poorer discrimination in younger patients. It was still quite reasonable (AUC ≥ 0.79) for patients <28 days of age (neonates) or even for neonates <7 days of age (ESM Table 3) [15, 30]. Only the PIM2-ANZ06 was well calibrated for both elective and nonelective admissions, which is a risk factor in the PIM models. The PRISM3-24 had better discrimination, but predicted mortality in elective admissions relatively too high. This suggests that severity of illness for elective admissions may need to be adjusted for mortality with an additional risk factor on urgency.

More extensive updating may be required, with re-estimation of the specific predictors within the models, to overcome the problem with subgroup-specific calibration. This was done for Australia and New-Zealand in the regionally updated PIM2-ANZ06 and PIM2-ANZ08 [13, 14, 18]. However, such model revision also affects the discrimination by the models, and hence the external validity [18, 31]. In our study, we found the most recent updated version of PIM2, the PIM2-ANZ08 (Table 1 and Fig. 3), was less well calibrated than the PIM2-ANZ06 in all subgroups and discriminated less in some categories (ESM Table 3). Nevertheless, both regional updated PIM2 versions showed external validity in our study population. A recent study into the external applicability of other re-estimated PIM2 versions concluded the local updated version could also be adapted outside the region of updating [32].

We included the original PIM and PRISM models into our analysis, as they are still used in different studies [5, 6]. It has been advised not to use the PRISM any longer, but we found that it performed overall still remarkably well after logistic recalibration with exception of (young) age groups and respiratory diagnoses (ESM Table 3) [33, 34]. The PRISM3-12 was not included as the data collection effort was too high, but a previous large validation study in the UK found both versions of PRISM3 to have quite similar overall performance [16]. Besides the PRISM mortality prediction models we analyzed the PRISM severity of illness scores and found they discriminated quite well on mortality also in subgroups, but their calibration was not as good as their risk prediction versions.

We divided outcome according to LoS at the PICU and showed discrimination decreased sharply for all models with a longer stay at the PICU. Only the PRISM3-24 was able to discriminate reasonably (AUC = 0.73) when stay in PICU lasted more than 6 days. Other factors than risk factors from the first admission day might be influencing later mortality in PICU [35]. Future model updates will need to address predicting outcome in these prolonged admissions.

Strengths of our study include that we externally validated all mortality prediction models simultaneously within a large cohort from the Dutch PICUs [19]. The staff was trained in guidelines on the different models before data collection started, and rules were strictly followed. Nevertheless in a large cohort study there are some limitations. Missing values are inevitable in observational studies, especially if they have to be performed within the first hour of contact with the patient. The differences between models in number of available values and values out of range (Fig. 1 and ESM Table 4) can be explained by the number of variables and collection period, on which topic some debate has taken place [26, 36].

The downside of directly comparing models with different exclusion criteria was the exclusion of part of the total PICU population that otherwise could have been included in one of the models (Fig. 1). Nevertheless, different exclusion policies had hardly any effect on ranking between the models in direct comparison. Details on data quality and model performance depending on different population selections are available in electronic supplementary material (ESM Table 5).

Another limitation of the study is the sample size. A sample with around 100 events (deaths) and 100 nonevents (survivors) is a minimum for sound validation studies [24, 37]. So far, only a few external validation studies in pediatric intensive care have been performed with such a sufficiently large sample size for the number of events [1517, 29]. Even though our study cohort was rather large (12,040 admissions), some clinically interesting diagnostic groups, such as patients with central nervous infections or sepsis, could not be analyzed separately because of small numbers in mortality. Such detailed studies will remain a problem that is difficult to overcome by individual centers and smaller countries. Collaborations on international PICU registries would here be helpful.

Conclusion

We consider the freely available PIM2(-ANZ06) and proprietary PRISM3-24 the best prediction models for Western European PICU registries, because of their good overall calibration and discrimination, and good performance in previous European validation studies. Both models discriminated well overall and in almost all subgroup categories, including patients <28 days of age and recalibrated PRISM3-24 was the most stable predictor between subgroups. None of the tested models here can be recommended for mortality prediction in patients that stay longer than 6 days at the PICU. Because of different performance in subgroups, models should be applied with caution for risk adjustment in subpopulations.