Introduction

Prognostic models for groups of children admitted to paediatric intensive care units (PICUs) are now standard components of the methodology used in intensive care quality assurance and research [14]. Indeed, they are integrated in databases of paediatric networks [57]. Two predictive scoring systems have been proposed for critically ill children: the pediatric risk of mortality (PRISM and PRISM III) and the paediatric index of mortality (PIM and PIM2) [8, 9]. PIM2 was initially validated in 20,787 critically ill children from 14 large PICUs in Australia, New Zealand and the UK and was validated after recalibration in the UK demonstrating satisfactory performance [10]. This score has never, however, been studied in a prospective multicentre study in France.

Prognostic scoring systems that are developed in one country require recalibration (i.e. adapted coefficients) before being used to provide risk-adjusted outcomes of PICU mortality for units within a new health care setting [10]. Local recalibration has been reported to improve the performance of adult scoring systems in individual countries or regions [1113]. Nevertheless, some authors have suggested that recalibration can compromise the ideal of comparisons between different geographic zones [13, 14]. Thus, the impact of the recalibration between two European countries on the performance of the score, in order to enable some comparisons, has never been evaluated.

Structural and process factors might explain variations in outcomes [15, 16]. The effect of volume of patients on quality of paediatric intensive care has been a subject of debate in North America [17] as three out of four studies found higher volume units had lower mortality [15, 1820]. A better understanding of this relationship is needed to develop effective regionalization and referral policies for critically ill children [15, 18]. The performance of a unit for its case mix can be compared with the expected performance of a reference group of providers for a similar case mix [21]. Before any comparison of performance between PICUs of different size is conducted, it is important that the risk-adjusted mortality is accurately predicted.

The first objective of this study was to test the performance of the PIM2 score in French-speaking (FS) PICUs and the relative performance of the PIM2 score when recalibrated using data from FS countries and Great Britain (GB). The second objective was to compare the performance of large, medium and small sized units in FS and GB PICUs.

Methods

All the 33 neonatal and PICUs (NPICUs)/PICUs and two PICUs affiliated to the Groupe Francophone de Réanimation et Urgences Pediatriques (GFRUP) were invited to participate on a voluntary basis: 17 agreed, and 15 (13 NPICUs/PICUs and two PICUs) provided the information requested (14 in France and one in Belgium). All consecutive children admitted to these PICUs between 21 June 2006 and 31 October 2007 were included. Children with a history of prematurity and hospitalized after birth were included. Those over 18 years and newborns who were premature (<37 weeks gestation) and admitted at birth were excluded.

Although the period of data collection varied between units, for each unit all patients admitted consecutively during the study period were included.

During the same period, 31 PICUs in GB from the PICANet database were analysed in order to perform comparative analysis. The selection process and data collection methods have been previously published [10]. Information were prospectively collected for each admission (demographic details, PIM2 score, outcome at PICU discharge (death, alive)). Children were excluded from the analysis if they were aged over 18 years, or less than 2 years and born prematurely (<37 weeks gestation).

Legal considerations

Data collected from routine practice were used in this FS observational study. The two databases were declared to and approved by the French and English authorities (see electronic supplementary material, ESM).

Data collection and management

Clinical data were prospectively recorded on a standardised case-report form. Previously trained physicians (one per centre) entered data into a Web-based database respecting confidentiality requirements (Epiconcept™, Paris, France) (see ESM).

Statistical analysis

Design of analyses

Outcome was vital status at PICU discharge. Validation of the PIM2 score was evaluated with the original published coefficients (FS-PIM2 and GB-PIM2 scores respectively). In each data set, a random sample of half of the population was selected to provide a development data set. Re-estimation of coefficients (recalibration) was conducted in this development set: individual component variables of the PIM2 score (independent variables) were entered into a logistic regression using the PICU discharge as dependent variable. These two steps were repeated 5,000 times (bootstrap procedure). Bootstrapping is a method employed to resample from the original data to create replicate data sets, from which the variability of the quantities of interest can be assessed. The aim is to create alternative versions of data that “we might have seen” [22]. From the 5,000 coefficients obtained, the best model was selected using the mean coefficients of each independent variable. Final coefficients of the recalibrated model were applied on the data set of each country, and two models named ‘GB-Rec-PIM2’ score and ‘FS-Rec-PIM2’ score were provided (Fig. 1, ESM). A cross-recalibration was performed as follows: coefficients of the recalibrated FS model were applied to the GB data set (model named ‘GB-RecFS coeff-PIM2’ score) and coefficients of the recalibrated GB model were applied to the FS data set (model named ‘FS-RecGB coeff-PIM2’ score) (Fig. 1, ESM).

This analysis was stratified by PICU level of activity. To define small, medium or large units, an equal number of PICUs in each group taking into account the number of admissions per month (admissions per year) was determined from the FS data set; the defined cut-offs were less than 20 (240), 20–35 (240–420) and greater than 35 (420), close to the two cut-offs mentioned in the study by Wolfler et al. [14].

Calibration, discrimination and performance

Calibration: In order to compare observed with expected mortality and to estimate the calibration of the PIM2 score, a Hosmer–Lemeshow goodness-of-fit test was performed using the logit of the PIM2 score following logistic regression [23] (see ESM).

Discrimination: Area under the receiver operating characteristic curve (AUC) and standard error were calculated to estimate the discrimination of the scores (see ESM).

Funnel plots: Funnels plots of PICU mortality ratios by number of admissions along with corresponding 99.8 % control limits were produced. Funnel plots are a variation on control charts that identify PICUs with unexpectedly high (above upper control limit) or low (below the lower control limit) mortality (see ESM).

Standardized W score: The standardized W score (Ws) produced the number of excess or lack of survivors per 100 patients compared with the prediction [24, 25]. The standard error (SE) of Ws is also calculated to provide the 95 % confidence interval (CI) [24] (see ESM).

Logistic regression: A logistic regression was used to study size effect, with size as categorical or continuous variable (see ESM). Statistical analyses were performed using STATAv10 statistical software (StataCorp LP 2004, Texas, USA).

Results

Among the 15 FS PICUs, no PICUs were exclusively devoted to cardiac surgery. Fourteen PICUs were devoted to medical, trauma and post-operative care (including cardiac surgery). One PICU admitted only medical patients. Among 5,651 patients, we excluded two patients older than 18 and 47 patients with incomplete data. Thus, 5,602 patients (414 died, mortality rate 7.39 %) were included. Primary category of illness on admission was congenital disease (33 %), infection (24 %), trauma (8 %), chemical injury (2 %), drug (0.1 %), cancer (3 %), diabetes (1 %), allergic immunologic disease (2 %) and other/undetermined (21 %). For GB PICUS, data are available on the PICANet website [26]. In GB PICUs 20,693 patients (1,014 died, mortality rate 4.90 %) were included. Characteristics and comparisons between the two populations are given in Table 1.

Table 1 Characteristics of the population

In FS PICUs, the FS-PIM2 score had a good discriminatory power (AUC 0.85; 95 % CI 0.83–0.87) and a moderate calibration (p = 0.07). The FS-Rec-PIM2 score had a good calibration (p = 0.33) and the GB-Rec-PIM2 had a moderate calibration (p = 0.06) (Table 2). The GB-RecFS coeff-PIM2 score displayed lack of fit and therefore poor calibration (p = 0.02), whereas the FS-RecGB coeff-PIM2 score displayed good calibration (p = 0.36) (Table 2).

Table 2 Discrimination (AUC) and calibration (chi square, p value) from the FS and GB data sets

In the GB data set, calibration plots showed that the GB-PIM2 tended to overestimate risk in low-risk mortality patients by less than 1 % in GB and 4 % in FS using the original and recalibrated equations (Fig. 2a, b for GB and Fig. 2c, d for FS PICUs, in ESM).

PICU size

Using the original PIM2 coefficients, calibration was good in small and medium units (p = 0.25 and p = 0.78 in FS and p = 0.99 and p = 0.25 in GB respectively) and poor in large units (p = 0.03 in FS and p = 0.001 in GB). In both groups, recalibrated and cross-recalibrated PIM2 scores provided good discriminatory power and calibration in small, median and large units (Table 2).

Funnel plots of crude mortality ratios by number of admissions for each unit indicated that three FS and seven GB PICUs had unadjusted mortality ratios that were lower or higher than expected (Fig. 1). Funnel plots using the original PIM2 score indicated that no PICU in France and only two PICUs in GB had adjusted mortality ratios that were lower than expected (Fig. 1). Funnel plots using the recalibrated PIM2 score indicated that no PICU in FS and only one PICU in GB had an adjusted mortality ratio that was lower than expected (Fig. 1). In the GB data set, one PICU had an observed mortality ratio equal to zero (no patient died in this PICU during the study period) and so this PICU was out of the lower limit of agreement in the three analyses (Fig. 1).

Fig. 1
figure 1

Funnels plots of GB and FS PICUs using crude mortality ratio, original PIM2 score mortality ratio and recalibrated PIM2 score mortality ratio. The mortality ratio then is plotted on the y-axis against the number of admissions to the PICU on the x-axis. To satisfy the condition that if the distribution of the mortality ratios is random there exists an ~5 % chance of a unit falling outside the control limits, then the upper and lower control limits must represent not 95 % confidence intervals but 99.7 % (confidence intervals around a mortality ratio of 1 by number of admissions [41]. *Crude mortality ratio: observed PICU mortality/whole population mortality. **Original PIM2 mortality ratio: observed PICU mortality/original PIM2 expected mortality (respectively FS PIM2 score and GB PIM2 score). ***Recalibrated PIM2 mortality ratio: observed PICU mortality/recalibrated PIM2 expected mortality (respectively FS-Rec-PIM2 score and GB-Rec-PIM2 score). PIM paediatric index of mortality, GB Great Britain, FS French speaking

Using the original PIM2 score, the standardized Ws scores demonstrated fewer survivors per 100 cases in small and large FS PICUs, whereas the standardized Ws scores demonstrated an excess of survivors per 100 cases in medium and large GB PICUs (Table 3). With the recalibrated PIM2 scores, the standardized Ws scores did not demonstrate a lack or excess of survivors in small, medium and large PICUs in either group (Table 3).

Table 3 Standardized Ws scores and odds ratios with original and recalibrated PIM2 scores in FS and GB PICUs

Odds ratios of PICUs size (small PICUs as reference) were not significant in both groups after adjustment for original and recalibrated PIM2 scores respectively (Table 3).

In the English data neither linear nor quadratic models showed any significant effect. In contrast with the French data a linear model showed no significant effect, but a quadratic model did (overall chi2 = 8.6, 2 df, p = 0.014) (Fig. 3a, b, ESM). The parameters indicated a minimum risk-adjusted mortality at about 35 admissions/month in the FS PICUs (Fig. 3a, ESM).

Discussion

First, this study has shown that the PIM2 score was valid in the FS population and that the recalibration based on GB data could be applied to FS PICUs. Second, for the recalibrated PIM2 score, the volume of patients showed no effect in GB but did in the FS PICUs, with a minimum risk-adjusted mortality at about 35 admissions per month.

The PIM2 score was chosen because it is free and is very simple to collect. Moreover, several studies have showed that the PIM2 score had a good performance in different countries with [3, 10, 27] or without [14, 28] recalibration. In our study, and considering only the FS data set, the performance of the PIM2 score was good with and without recalibration.

Because of diversity in case mix, structure, organization, staffing and management between different countries [8, 29], region-specific or country-specific equations have been proposed in order to compare ICUs on a similar level [30]. Usually, discrimination and calibration of prognostic scoring systems are good in the population used for the development and for the internal validation. However, in almost all cases, when these scoring systems are applied to a new population, calibration deteriorates although discrimination hardly changes [13]. Recalibration had a large impact on the performance of the models, improving in particular its calibration [31]. We have used the same recalibrated PIM2 score to make the comparison between these two different countries. Conversely, “there is a ‘risk’ to risk adjustment” because different models will not always agree on the identity of outlier performing institutions and on ranking the institutions [31]. The PIM2 score was initially developed from a study in Australia, New Zealand and UK, without recalibration between these three countries [9]. Furthermore, cross comparisons between different countries with recalibrated severity scoring systems had never been evaluated. In our study, the recalibrated PIM2 score in the GB data set could be applied to the FS data set, but not vice versa. Hypotheses to explain this result might be the larger population of the GB data set and the inclusion of all GB PICUs which cover a wider range of individuals or risks. Recalibration based on the GB data set and applied to FS and GB PICUs data sets have shown that case mixes from two countries can be considered as case mixes from two regions (states) in the same country. Indeed, differences in case mix between PICUs in the same country have been previously reported. Outcome description of 20 PICUs in the USA showed a significant variation among centres in mortality rate (from 1.4 to 21.3 %) or mean PRISM III score (from 2.42 to 11.18) [32]. Moreover, the annual report of the Australian and New Zealand Paediatric Intensive Care Registry in 2009 described mortality rates from 1 to 9 % and mechanical ventilation percentages from 7.5 to 100 % (three centres with no ventilated patients excluded) [33].

Definitions of the size of units vary between paediatric studies: small and large units were determined by the population median of six or fewer beds and more than six beds [17, 19], institutions’ volume as medium (mean 45.6 patients/month), low (mean 26.8 patients/month) and high (mean 72.6 patients/month) [20]. In another study, cut-offs were not defined but there was a slight increase in mortality rates among PICUs with very high annual admission volumes (about 1,500 admissions per year) [18]. In the study by Wolfler et al., all PICUs were paediatric (mean number of children per year per unit 181) and most PICUs were small (4–6 beds) and had no more than 200–400 admissions per year [14]. Using a comparable definition of size, we did not observe any differences between centres on discrimination and calibration. In our study, the funnel plots for the recalibrated PIM2 indicated that the adjusted mortality rate for all units in France and the UK was consistent. These results were observed with the PIM, PIM2, PRISM and PRISM III scores in 26 PICUs in the UK [10]. The relation between SMR and FS PICU volume suggested a reversed J-shaped relationship. Such a relationship was previously observed in data from North America PICUs, with a lowest theoretical threshold of 1,250 annual admissions, more than observed in our study (420 admissions per year) [18]. A recent study in Australia and New Zealand PICUs used a modified plot of risk-adjusted mortality ratio versus unit mean length of stay: two units were designated as inefficient and one unit was considered to be effective at the expense of high resource use [34].

Strengths and limitations

The two databases were of different size. It is possible that the different sample sizes between the two countries explain this result. The GB sample was larger than the FS sample and the probability that the GB sample included patients similar to those of the FS sample is more important than the opposite scenario. Thus, the result of the recalibration based on GB data applied to FS PICUs could be better. Nevertheless, the FS data set included more than 5,000 patients and, thus, the power of the statistical analysis seems acceptable. The case mix was different between these two countries. Infant mortality rate is lower in France than in the UK (3.6/1,000 versus 4.7/1,000) [35], but in the present study, crude mortality was higher in the FS database. Such differences have previously been observed: crude mortality in PICUs was lower in Australia (4.1%) than in the UK (8.2 %) [9], whereas the infant mortality rate (4.4/1,000 versus 4.7/1,000 respectively) is about the same in these two countries [35]. In the study by Brady et al. optimized models were assessed by random allocation of PICUs (stratified by annual admission number) into development and validation samples in a 2:1 ratio [10]. In our study, a randomization by patient and a bootstrap procedure were applied on half of each data set to decrease the potential inflation of the new predictive equation applied to the entire data set. The bootstrap methods can be used to assess the strength of evidence that an identified variable is an important predictor [36]. For the Hosmer–Lemeshow goodness of fit test, we used degrees of freedom equal to the number of groups [37]. The use of degrees of freedom equal to number of groups minus 2 gave the same results (data not shown) except for the original FS-PIM 2 and GB-Rec-PIM2 scores which were not calibrated (p = 0.03 and p = 0.02 respectively). The relationship between volume and outcome is influenced by few variables, such as length of stay, nursing workload and number of premature infants cared for in NPICU, which were not taken into account in our analysis [34, 38]. Finally, in the FS data set only the voluntary PICUs were included, whereas Glance [39] has proposed that all ICUs participate in a national ICU-outcome database as was the case for the GB data set. Universal participation would eliminate the possibility of selection bias, in contrast to voluntary participation that may result in a non-representative group of ICUs [39].

Conclusion

This study has demonstrated that PIM2 was valid in the FS PICUs. Also, the same recalibration could be applied to two different countries. Calibration of scoring systems should be reassessed periodically to ensure their continued validity. Nowadays, severity scoring systems have to be applied to monitor outcome and, thus, improve the quality of paediatric intensive care networks [40]. If a European paediatric intensive care registry is established, the results of this study support the possibility that a single, appropriately calibrated, risk adjustment model could be used for quality improvement and research within Europe and for comparative analyses between European countries.