Introduction

Risk adjustment systems, such as the Paediatric Risk of Mortality (PRISM) score and Paediatric Index of Mortality (PIM), are widely used in paediatric intensive care. These systems are used to allow assessment of severity of illness in heterogeneous patient groups in an objective manner, and to convert these risks into a numerical mortality risk. The purpose of their usage varies and may include comparison of severity of illness between different treatment arms in clinical trials as well as benchmarking, i.e. comparison of quality of care between different paediatric intensive care units (PICUs) using standardised mortality rates (i.e. mortality rates that have been adjusted for severity of illness). Both the PRISM and PIM scoring system have been developed and validated in tertiary PICUs. [1, 2, 3]. In some centres that were closely involved in developing these scoring systems, preliminary data showed that the degree of interobserver reliability was acceptable [4, 5, 6]; however, these centres had a small and dedicated number of thoroughly trained professionals who were responsible for the scoring of patients. This form of organisation is likely to result in low interobserver reliability; however, the practical situation in numerous ICUs and PICUs throughout Europe is that severity scoring is performed by a varying number of residents, fellows, (paediatric) intensivists, paediatricians or nurses, with varying degrees of PICU experience and varying degrees of experience and training in the use of PRISM and PIM scores [7].

We previously demonstrated that significant degrees of interobserver variability in the use of PRISM and PIM scoring exist in everyday clinical practice, in physicians with different levels of training and experience [8]. Based on this observation we implemented a training program to improve the use of these risk adjustment scores.

This paper reports the results of this training program in improving accuracy of scoring and decreasing interobserver variability. All physicians who had participated in our first study received this training and were subsequently asked to participate in the present study.

Methods

Physicians from six academic PICUs (tertiary referral centres) with residency and fellowship training programs were asked to participate in our study. Physicians were divided into three categories: residents (n=9) with limited experience in paediatric intensive care (average: 3 months; range: 6 weeks to 6 months); PICU fellows (n=6, average experience: range 6–30 months); and paediatric intensivists (n=12) with at least 3 years of full-time PICU experience. The charts of 20 patients that had been admitted to a single PICU in the course of a 1-year period were selected for scoring and randomly divided into two sets. The first set of ten charts was used before the training program. Charts of the second ten patients were used thereafter. The charts were selected to reflect typical PICU patients and were not chosen for difficulty of scoring. Relevant data from the medical charts and copies of blank data collection sheets from the PRISM and PIM scores were provided to all participating physicians. Subsequently, these physicians assessed the scores and filled out the data collection sheets. From the PRISM and PIM scores, calculated by the individual physicians, mean (SD) and range of scores were determined for each individual patient for the overall group of physicians and for each of the three different physician categories, according to methods described previously [9].

We observed significant interobserver variability in both PRISM and PIM scoring before implementation of our training program [8]. Based on these findings and on the specific problems in scoring interpretation that were identified, we implemented a training program and organised training sessions for all participating physicians. In these training sessions the guidelines of both scoring systems were extensively presented and discussed, and various pitfalls encountered in the first part of our study were discussed in detail. In addition, after the training sessions, the physicians received a summary of the guidelines for reference, as well as a summary of the subject matter of the training sessions.

Subsequently, the physicians were asked to assess the PRISM and PIM scores of the second series of 10 patients (group 2). Mean, standard deviation (SD), and range of PRISM and PIM for each patient were calculated for the whole group of physicians and according to level of experience and were compared with those before the training. Intraclass correlation was used to compare the reliability before and after implementation of training and strict guidelines for all physicians and per level of experience

Statistical assessment was performed using Student’s t-test for unpaired variables for paired groups and by analysis of variance (ANOVA). Variability between observers was assessed by determining intraclass correlation coefficients.

Statistical significance was accepted for p<0.05. Excel (Microsoft, Redman, Wash.) and SPSS 9.0 (SPSS, Chicago, Ill.) software was used for all calculations.

Results

The results for the whole group of physicians before and after training are shown in Tables 1 and 2. The PRISM- and PIM-based mortality risks in individual patients before and after training and implementation of guidelines are shown in Table 1. The intraclass correlation for PRISM and PIM scoring before and after training are shown in Table 2. The PRISM- and PIM-based mortality risks divided by category of physicians are shown in Table 3 (PRISM scores) and Table 4 (PIM scores).

Table 1 Interobserver agreement of Paediatric Risk of Mortality (PRISM)- and Paediatric Index of Mortality (PIM)-score-based mortality risk (%) before and after implementation of guidelines and training program for all physicians (n=27)
Table 2 Intraclass correlation for PRISM- and PIM-score-based mortality risk before and after administration of guidelines and training
Table 3 Interobserver agreement of PRISM-score-based mortality risk (%) after guidelines and training divided according to different levels of experience
Table 4 Interobserver agreement of PIM-score-based mortality risk (%) after guidelines and training divided according to different levels of experience

As can be seen in the crude data in Table 1 and the calculations in Table 2, we observed substantial interobserver variability in both PRISM and PIM scoring before implementation of our training program. For PRISM scores average intraclass correlation was 0.51 (range 0.32–0.78); for PIM scores the average intraclass correlation was only 0.18 (range 0.08–0.46). This variability occurred in both experienced and inexperienced physicians [7].

Interobserver agreement for both PRISM- and PIM-score-based risk assessment improved significantly after implementation of guidelines and training. The intraclass correlation after training varied from 0.74 to 0.86 for the PRISM scores, and 0.88 to 0.95 for the PIM score. The increase in intraclass correlation following training was statistically significant (p<0.01).

When subdivided according to levels of PICU experience (consultants, fellows and residents, respectively) we found that before training the intra-class correlations for PRISM scoring were significantly lower for residents than in the group of intensivists (p<0.01; Table 2). No such differences were observed for the PIM score (indeed residents appeared to perform slightly though not significantly better). Compared with the measurements before training, there was a substantial decrease in interobserver variability in all three categories of physicians, as indicated by the significant difference in intraclass correlations for the whole group of physicians (p<0.01) and per level of experience (p<0.05 for intensivists, fellows and residents, respectively).

The results for individual patients divided by category of physicians are shown in Table 3 (PRISM scores) and Table 4 (PIM scores). Following training, the differences in performance between the three groups decreased, with intraclass correlation ≥0.74 observed in all groups of physicians.

Discussion

The results of our study demonstrate that training and implementation of strict guidelines are required for reliable assessment of the PRISM and PIM scores. Those physicians using these risk adjustment systems for PICU quality assessment and benchmarking should take this into account.

Our assessment of variability before training revealed a surprisingly high level of variability, with an average intraclass correlation of 0.51 (range 0.32–0.78) for the PRISM score and an average intraclass correlation of only 0.18 (range 0.08–0.46) for the PIM score. These figures were well below our expectations; however, they are likely to reflect the reality in numerous European PICUs where regular training in use of these risk adjustment systems has not been rigorously implemented. Moreover, physicians with varying degrees of experience from different medical centres participated in our study, which increases the likelihood that our results reflect the actual situation.

Our second measurement showed much improved results. The average intraclass correlation after training and guidelines was 0.80 (range 0.65–0.93) for the PRISM score and 0.89 (range 0.80–0.97) for the PIM score. The changes tended to be more prominent in less experienced physicians but were also observed in paediatric intensivists who have at least >3 years of PICU experience. Both series of patients were randomly selected to represent the spectrum of patients that are admitted to the PICU, so it is unlikely that a more complicated group that might incur a lower intraclass correlation may have been selected for the first measurement. Indeed, the average scores were actually somewhat higher in patients that were selected after training, which is likely to increase the likelihood of error. This implies that the effects of training are likely to have been somewhat underestimated in our study.

However, even after training and implementation of guidelines, a significant degree of variability in scoring persisted, even in experienced intensivists with comparable training, experience and background; therefore, it seems likely that some degree of variability is inherent in PRISM and PIM scoring, at least in current clinical practice PICUs in the Netherlands. There are no important differences in the way in which these systems are used between PICUs in the Netherlands and most other European PICUs; therefore, our results are likely to reflect the situation in PICUs with similar forms of organisation throughout Europe.

An interesting observation is the difference in variability and intraclass correlations between PRISM and PIM scores. Before training, intraclass correlation was lower for PIM compared with PRISM, whereas after training PIM had a slightly higher intraclass correlation. The reason for these differences are unclear. In theory, the observation that intraclass correlation was initially lower for the PIM score may be explained by the fact that the PRISM scores were the first to be implemented in everyday clinical practice, and therefore have been used for longer periods of time. Their earlier introduction and the period during which PRISM scores were the only risk adjustment system available for the paediatric population may have made PRISM scores somewhat more familiar to paediatric clinicians, even though PIM scores have also been used for several years. Our observation that especially experienced intensivists had comparatively high intraclass correlations for PRISM scoring, with far lower scores for PIM scoring, lends some credence to this hypothesis. An additional factor could be the lower number of variables in the PIM score, which could have led to a greater proportional effect of individual errors. This could also help explain the greater improvement in PIM scoring associated with training: if the lower number of variables in PIM scoring led to greater proportional disagreements before training, any reductions in these errors after training would also lead to greater proportional improvements in intraclass correlation; however, these potential explanations remain speculative, as direct comparisons between PIM and PRISM scoring were not made, and reasons for potential differences were not determined in our study.

Previous studies comparing the reliability of PIM and PRISM scores have reported that both are adequate indicators of probability of mortality for heterogeneous paediatric patient groups [10, 11, 12], with the PIM score performing perhaps marginally better in paediatric cardiac surgery patients [12]. In recent years PICU mortality has decreased, and overall outcome in paediatric critical care has improved significantly. Long-term outcome has also improved, with good functional recovery and quality of life for surviving patients [13]. This has led to a relative overestimation of mortality by both PIM and PRISM risk adjustment systems. The PIM score has recently been revised to take improvements in outcome into account [14].

A potential limitation of our findings is that there was no attempt to determine overall “accuracy” by comparison with a gold standard, i.e. if all observers would make the same mistake, overall agreement would be good, whereas accuracy would be poor. This again might have led to underestimation of the problems with severity scoring; however, to address this issue would require a separate study in which scores are compared with a gold standard, which could consist of a panel of experts (who would have to score all patients according to pre-defined criteria and agree on all issues).

Another potential limitation is that different patients had to be used for the two measurements of variability, to prevent physicians “remembering” issues about individual patients which would have influenced the results. In theory, one of the groups of patients could have been more “difficult” to score, leading to greater variability. Indeed, average scores were slightly higher for the second measurement, indicating the presence of a number of more severely ill patients. In theory, this could imply that variability after training may have been somewhat overestimated in our study; however, the fact that significant variability occurred also in patients with lower scores during the second measurement, and that variability as a percentage of the score in each patient was fairly constant, makes it highly unlikely that this would have significantly affected our overall results and conclusions.

Reliability of risk adjustment may be improved, and variability decreased, if severity of illness scoring is performed by a restricted number of dedicated individuals who are well trained and regularly audited; however, the efficacy of this strategy needs to be determined in future studies, and does not reflect the current overall situation in European PICUs. Our present study shows that substantial improvements in reliability may well be obtained using a rigorous but relatively uncomplicated training program and guidelines. Continued reliability may well require regular updates and audits.

The observations in this multi-centre study are in accordance of previous observations by our group and others [9, 15, 16, 17, 18] on everyday use of the APACHE II scoring system, which is the most widely used risk adjustment system in adult ICUs. The use of this scoring system is associated with interobserver variability of up to 30% in everyday clinical practice [16, 17]. This decreases to around 15% after implementation of guidelines and training [9].

Previous authors have suggested that an ICC above 0.80 should be considered acceptable in a clinical setting [19, 20]. In our study neither the PIM nor the PRISM score reached this value before training. After training both scores realised intraclass correlations ≥0.80 (albeit only just in the case of the PRISM score). Nevertheless, a degree of variability remained even after training, a fact that physicians using these scores should be aware of even if the degree of variability is deemed acceptable. Some authors have suggested that risk adjustment systems could be used to predict outcome in individual patients [21], although their use for this purpose remains controversial both in the adult and paediatric populations [22, 23]. If attempts are made to predict risk of death in individual patients, issues of reliability of assessment and interobserver variability become even more important.

Another novel application of severity scoring systems is for selecting patients who might gain the greatest benefits from specific treatments. An example of this is the use of APACHE II scores to select patients with severe sepsis for treatment with activated protein C [24]. The use of APACHE II scores in this way is based on observations from the PROWESS trial [23]. This study, which reported a significant decrease in mortality in a large group of patients with severe sepsis treated with activated protein C, observed greater benefits in patients with higher APACHE II scores compared with the overall group [25]; however, the use of risk adjustment systems for such purposes has been challenged on various grounds [26]. Systems such as APACHE II, PRISM and PIM were designed for outcome prediction in large groups of patients, and have never been validated for risk assessment in individual patients. In our opinion, great caution should be taken when making decisions on allocation of resources and treatments in individual patients based on risk adjustment systems. This view is reinforced by observations that organisational changes, case mix of patients and the transfer of patients between units can substantially affect various benchmarking tools to assess ICU performance, including the frequently used standardised mortality ratio [27, 28].

Conclusion

In conclusion, although PRISM and PIM scores are valuable tools in paediatric intensive care, it is important to realize that reliability of risk adjustment systems in everyday clinical practice is highly dependent on the implementation of training, guidelines and regular audit of these scoring systems. Even when these precautions are taken, a degree of interobserver and even intraobserver variability is likely to persist. The observations in adult ICUs and our current findings in the paediatric setting underscores the importance of being aware of the limitations of risk adjustment systems, especially when the are used for benchmarking and to assess quality of care in the (paediatric) ICU.