Introduction

Mortality prediction models are an important component of the evaluation of the overall quality of care provided by pediatric intensive care units (PICUs) [1]. These models assess the risk of death for each individual patient according to physiologic data collected at admission. These models predict individual patient outcomes to determine aggregate mortality rates of PICUs or group of patients according to physiologic data available at admission. Better or worse than average quality of care is indicated by discordance between observed and expected mortality rates which are predicted by the models [2].

However, a disagreement between observed mortality rates and those predicted by a model can be caused by other reasons than the quality of care. Indeed, when a predictive model is developed, the prediction rule is optimized for the prediction of the outcome in the study population and at the time of the study. The model may under or overestimate mortality when applied to other populations or many years after its development [3].

From January 2012, the Swiss Society of Intensive Care Medicine requires the use of the Pediatric Index of Mortality (PIM) 2 score as a model for predicting the risk of death for all pediatric admissions to all PICUs in Switzerland. This score has been developed more than 20 years ago in Australia, New Zealand and the UK [4]. Therefore, by using the PIM 2 score, the present quality of care is assessed in regard of the quality of care 20 years ago. In addition, the performance of this model has never been assessed in Switzerland.

In the last few years, a drift in calibration of the PIM 2 score, which is the degree of agreement between predicted and observed deaths, has been detected in two large national registries in Australia, New Zealand and the UK [5, 6]. The use of PIM2 might therefore result in an overestimation of the quality of care in a country.

The aim of this study was to evaluate the predictive performance of the PIM2 score by analyzing its discrimination and calibration in the whole cohort as well as in the pre-specified subgroups of patients admitted to PICUs in Switzerland.

Materials and methods

This retrospective study analyzed the data from the Minimal Intensive Care Unit Dataset (MDSi) of the Swiss Society of Intensive Care Medicine that systematically collects information on all pediatric admissions to all PICU in Switzerland.

Patients

The study population is represented by all children younger than 16 years admitted to any PICU in Switzerland between January 1, 2012 and December 31, 2017. Premature infants less than 36 weeks of gestational age and patients with a diagnosis at admission of “organ donor” were excluded from the analysis. A new health record number is systematically reassigned to patients who are admitted to another unit. As a consequence, it was not possible to assess the calibration and the discriminative power of the score for patients transferred to/coming from other PICUs. Those patients were therefore excluded from the study.

Children who could not be identified with one primary admission diagnosis were also excluded. For patients transferred to/coming from other PICUs (2078 cases, 8%), mortality could not be established, therefore they were also excluded from the study.

Data

All 8 tertiary PICUs in Switzerland collect data to the MDSi using a standard coding for admissions. Since 2012, the pediatric MDSi uses the ANZPIC Registry diagnostic codes to classify children into the following primary admission categories: cardiac (including postoperative), cardiac or respiratory arrest, trauma, neurology, oncology, respiratory, sepsis with or without septic shock, miscellaneous, general medical, and general surgical [7].

Statistical analysis

Data are summarized as frequencies and percentages for categorical variables and as mean (standard deviation) and range for continuous variables. Characteristics were compared between survivors and non survivors using t tests and Chi-2 tests. Patients were divided into age groups (newborn, infant, pre-school, school, and adolescent). The performance of PIM2 score was assessed by analyzing its discrimination and calibration in the general population and different subgroups (age, year of admission, diagnosis at admission, and operative status). Calibration was assessed by standardized mortality ratio (SMR, the ratio between observed and predicted deaths) with 95% CIs and calibration plots. In addition, Hosmer-Lemeshow (HL) goodness-of-fit test was applied to compare the mortality observed and predicted by PIM2 score when patients were categorized according to the deciles of risk predicted by PIM2 score [8]. Discrimination, which is the extent to which a model predicts higher probability of having an event among patients who will versus those who will not have an event, was assessed through the area under the receiver operating characteristics curve (AUC-ROC) [9]. All statistical tests were two sided, and the significance level was 0.05. All analyses were performed using R statistical software version 3.4.3 (The R Foundation for Statistical Computing).

Results

During the study period, 25,041 admissions were registered; 2659 (10.6%) were excluded from the analysis because of exclusion criteria (Fig. 1). A total of 22,382 patients were analyzed. Demographic characteristics for these patients are summarized in Table 1. Main diagnostic categories were represented by “miscellaneous” (28.5%) followed by patients with respiratory indications (27.4%). Most frequent medical diagnosis found among “miscellaneous” patients were represented by gastrointestinal/bowel obstruction (8%), infection (5%), patients needing PICU surveillance after invasive procedures (4%), and patients with decompensated diabetes (3%).

Fig. 1
figure 1

Flow-chart of the study population

Table 1 Characteristics of the study population according to the vital status

PIM2 performance in entire cohort

The overall observed mortality was 2% (445/22,382), whereas mortality predicted by PIM2 (4.2%, 949 deaths) was twice as high (SMR = 0.47, 95% CI, 0.42–0.52). To better describe this issue in calibration, patients were categorized according to the risk predicted by PIM2 score. Except in patients with a predicted risk lower than 0.2%, the observed mortality was systematically lower than the mortality predicted by PIM2 score (Table 2). The difference between the number of deaths observed and predicted by PIM2 score was statistically significant (p < 0.001). The calibration plot (Fig. 2) represents graphically the overestimation of the mortality by PIM2 score. The AUC-ROC for the entire cohort was 0.88 (95% CI, 0.87–0.90).

Table 2 Mortality observed and predicted by PIM2 score per category of predicted risk. For a perfectly calibrated predictive score, the predicted numbers of deaths are equal to the observed numbers of death
Fig. 2
figure 2

Receiver operating characteristic curve (left), calibration plot (middle), and boxplot (right) of PIM 2 scores by vital status for the entire patients’ population

Analysis by age group

The performance of the PIM 2 score across age groups was similar to that of the entire cohort with poor calibration and acceptable discrimination (Table 3). Neither the SMRs (p = 0.208) nor the AUC-ROCs (p = 0.235) were significantly different across age groups.

Table 3 Calibration and Discrimination of the PIM 2 score by age group

Analysis by diagnostic group

The SMR varied importantly across diagnostic groups and the difference was statistically significant (p = 0.004, Table 4). The over-estimation of mortality by the PIM 2 score was especially high for respiratory disease and miscellaneous: the predicted mortality was approximately 4 times higher than the observed mortality. In contrast, for sepsis and cardiorespiratory arrest, the predicted mortality was slightly lower than the observed mortality. The discrimination ability of the PIM 2 score also varied across diagnostic groups (p < 0.001). The poorest discrimination ability was observed in patients admitted for oncologic diseases (AUC-ROC = 0.74, 95% CI 0.26 to 1.00) and the highest in patients admitted for injury (AUC-ROC = 0.98, 95% CI 0.97 to 0.99).

Table 4 Calibration and discrimination of the PIM 2 score by admission diagnostic groups and operative status

Discussion

Detailed prognostic information is extremely important for patients and physicians for quality of care assessment and benchmarking. Accurate estimation of the severity of illness might also contribute to the appropriate allocation of PICU resources and admission triage. Predicted risks can be used to evaluate the outcome of one institution compared with others either directly, by comparing risk-adjusted outcomes between institutions, or indirectly, by comparing outcomes for the single institution against those predicted by the model [10]. A well-performing model also contribute to compare a PICU current performance with its past.

PIM2 recalibration was last updated more than 10 years ago [1]. Therefore a reevaluation of the PIM2 score was needed in Switzerland. We assessed the performance of the PIM2 score in the whole population of patients admitted to PICU in Switzerland. Our results indicate that the score has an adequate capacity for discriminating between survivors and non survivors in the general population as well as in age subgroups. Discrimination was excellent in trauma and neurological subgroups, while PIM2 score showed a poorer but still adequate discriminative power among septic and oncologic patients. Nonetheless predicted mortality exceeds observed mortality by more than 50% in the whole study population. A SMR less than 1 provided by a non-up-to-date model such as PIM 2 might be misleadingly interpreted as evidence that the quality of care is better than it actually is [11].

Calibration across diagnostic groups is also important. If a model over predicts or under predicts death in a specific group of patients, the performance of PICU assessed by the model will be biased by the number of patients admitted in that diagnostic group [12]. In our study poor calibration was observed across mortality risks and age groups. It is important that PICU prediction models calibrate across diagnostic groups as diagnostic mix varies among units. Mortality was also overestimated in almost all diagnostic subgroups. This observed drift in calibration might be explained by improved patients’ care, changes in the mix of patients, timing of intensive care admission. However in septic patients and patients admitted after a cardiorespiratory arrest observed mortality exceeded expected mortality by 13% and 18%, respectively. Possible explanations might be represented by variations in thresholds for commencing and discontinuing life support in patients with particularly severe clinical conditions and the possibly inappropriate application of the model in the setting of a rapidly evolving critical illness. The fact that PIM2 score over or under predicts mortality according to primary diagnosis might help interpret the overall performance of PICU still using the PIM2 score on the ground of the proportion of patients admitted in different diagnostic categories [4, 13].

The PIM2 score under predicts mortality among low-risk patients. Noninvasive procedures (i.e., arterial blood sampling) are not systematically performed among less severe patients. Therefore variables such as base excess or PaO2 might have a high percentage of missing data which need to be imputed. Missing observations are usually given values considered normal, thus potentially undermining prediction in this category of patients.

Prognostic models are periodically calibrated by adding new variables to compensate for the deterioration of calibration that occurs as a result of change in medical practice and case mix over time [14]. A well-calibrated model allows PICU to compare their performance with the local as well as international standards of care. Given the less than satisfactory performance of PIM 2 score in Switzerland, the use of an updated prognostic score should probably be recommended.

A new version of the PIM score, the third edition of the PIM score (PIM 3), is available. PIM3 has been updated in 2013 in an effort to counterbalance a drift in calibration observed in two large national datasets [15]. This model, based on more recent data, provides better estimates of mortality risks among children admitted to PICUs in Australia, New Zealand, and the UK. Besides the introduction of a quadratic transformation of systolic blood pressure (SBP), the reorganization of diagnostic categories with the introduction of the “very-high-risk diagnosis” and a different classification of the term “recovery post procedure”, the variables included in the PIM 3 model are substantially similar to those used in the PIM2. Nonetheless the PIM 3 seems to perform better than the PIM 2 and might represent a valid alternative to PIM2 [16]. The assessment of the performance of PIM 3 score in Switzerland by means of multicenter prospective studies is warranted before its systematic implementation.

Our study has strengths. The analysis included data from a large, national database containing vast clinical information from all the patients admitted to Swiss PICU, ongoing audits of data quality and validation procedures. The fact that the totality of the population of children admitted to Swiss PICU during the study period was analyzed increases the robustness of our results. Moreover this is one of the first studies to evaluate the performance of PIM2 in uncommon but clinically relevant diagnostic subgroups such as septic patients and patients admitted after a cardiorespiratory arrest. Although a more recent version of the PIM2 score is available, we believe our results might be of interest for the several PICU that still use PIM 2 for mortality prediction [17, 18].

Our study has also limitations. One important limitation is its retrospective nature. Moreover data were collected over 5 years, and mortality rates might have changed during the study period. Additionally, the “miscellaneous” group resulted the most numerous group of patients. Given its heterogeneity, the generalizability of our results for this category of patients may be problematic.

Conclusion

Our results indicate that the performance of the PIM 2 score in a cohort of Swiss patients is poor with adequate discrimination and poor calibration. Calibration varies significantly according to primary diagnosis. The PIM 2 score tends to under predict the number of deaths among septic patients and in patients admitted after a cardiorespiratory arrest. The use of PIM 3 score might represent a potential alternative to PIM2.