Keywords

Introduction

Today’s health care environment is focused on providing both high quality and error-free care. Transparency is becoming an expectation, with many outcomes reported publically. Comparative data bases with case-mix adjusted outcomes are available to many children’s hospitals. High-profile programs such as pediatric cardiovascular surgery and pediatric critical care have access to national and international benchmarks for outcomes. Parallel to the movement focusing on quality, there has been a national effort to reduce errors, especially those falling into the category of “never events”. This movement was sparked by the Institute of Medicine’s (IOM) 1999 report highlighting the need to reduce medical errors with subsequent recommendations in 2001 to improve quality and promote evidence-based practice [1, 2].

A common adage is “you can’t improve what you can’t measure”. Scoring systems are usually an objective measure which can assess quality of care, assist with the evaluation and modification of complex systems of care, improve patient outcomes, and predict morbidity and mortality. Their role has become secure in critical care because physician’s judgments are too subjective to be used for quality assessment in large samples, as well as the fact that there is a need for severity of illness assessments in clinical studies. Physician prognostication may be inaccurate for a number of reasons [311]. First, there are differences in a physician’s ability to predict outcome based on the stage of the practitioner’s career. Second, there is a tendency to overly weigh recent experience, particularly when experience with a specific condition is limited. Third, physicians may be unable to continuously account for all relevant clinical components that are important in predicting outcome. Finally, literature supporting evidence-based medicine is ever expanding and may exceed many clinicians’ ability to remain current [1219].

History

External influences have played a significant role in stimulating an environment that favors the assessment of outcome and the development of scoring systems designed to accomplish case-mix adjustment. In the 1960s, the US federal government focused its attention on social issues, including the healthcare safety net. Medicare and Medicaid programs were developed as fee for service plans, with access to healthcare services regardless of the ability to pay. Universal access to care and the concept of patient entitlement changed the perception of healthcare from a privilege to a right. Medical advances in areas such as dialysis and mechanical ventilation led to increasingly “high-tech care” and advanced the emerging specialty of critical care medicine. The result of these changes was increased utilization and cost. As healthcare costs exploded, there was increased focus on appropriate utilization of resources, the quality of these services, and the relationship between cost and quality (e.g. the value equation). These concerns also stimulated the need for objective scoring systems.

Concerns about the quality of medical care have escalated over the last 20 years. In the 1990s, the New England Journal of Medicine published a series of articles highlighting medical errors [20, 21]. Public interest in quality became more visible following the Institute of Medicine Report in 1999 [1]. This and subsequent reports ultimately led to the Patient Safety and Quality Improvement Act of 2005, which led to the establishment of a system of patient safety organizations and a national patient safety database.

Intensive Care Units (ICUs) were early leaders in developing methods of quality assessment using accurate and reliable adjustments for case-mix differences. Prognostication in the ICU is essential to the debates about quality and cost. It is essential that the technologically advanced care provided in ICUs results in a meaningful outcome to patients and families. Mortality, morbidity and functional outcome prediction are central to these discussions. Thus, prognostic methods are a logical focus for intensive care physicians.

Early scoring systems such as the Glasgow Coma Scale and the Apgar score were developed to assess outcome in select populations. In the 1970s and 1980s, subsequent scoring systems were developed to appraise global ICU care and outcomes which allowed for evaluation of quantity of care, quality of care, and cost. The physiology-based scoring systems enabled case-mix adjustments and comparisons. Importantly, these developments led to the conclusion that there were differences in the practice patterns and quality among ICUs [22].

Physiology-based scoring systems were initially built on the concept that there is a direct relationship between mortality and the number of failing organ systems. Organ system failure results from physiologic derangements that were modeled to produce relatively accurate mortality risk estimates. Mortality prediction scores included the Acute Physiology and Chronic Health Evaluations Score (APACHE) and the Pediatric Risk of Mortality Score (PRISM). These early scoring systems have been modified and adapted, while others have been developed that evaluate morbidity or apply to select patient populations.

Use of Scoring Systems

Physicians and hospitals use internal and external benchmarking to assess quality of care. Benchmarking establishes an external standard reference to which performance levels can be compared and may be used to define “best practice”. Internal benchmarking allows an organization to compare performance within itself while external benchmarking compares performance between hospitals or services such as ICUs. Scoring systems may also be incorporated into clinical pathways to remove subjective assessments and incorporate evidence based medicine. These pathways are developed to improve care quality, decrease variability between individual providers and improve the efficiency through which care is delivered. The ultimate goal is to deliver high quality care in the most cost effective manner.

Clinical trials often include scoring systems. Scoring systems can be used to control for case-mix index, control for severity of illness between treatment groups or aid in risk stratification of enrolled subjects. They can also be used to compare expected to observed outcomes. Mortality prediction scores such as APACHE and PRISM have commonly been used in this manner.

While it may seem attractive to apply probabilities to direct patient care, this practice is potentially problematic. Risk assessment is less reliable when applied to an individual patient. In particular, the real range (i.e. the 95 % confidence interval) of the computed estimate is often much larger than the user appreciates. This is especially relevant when the computed mortality risk is very high, but the confidence interval is very wide (i.e. imprecise). Perhaps most importantly, the prognostic performance of physicians is approximately equivalent to the performance of scoring systems for individual patients [7, 8, 23]. At this time, scoring systems are primarily intended to provide objective assessment of quality of care, to assess the effects of interventions provided by a healthcare system [19], and for use in severity of illness assessment for individual trials but should not be applied to individual patients.

Elements of a Scoring System

The important elements of a successful scoring system include outcome, predictor variables, and model. The outcome should be objective, clearly defined and relevant. Historically, mortality has been the primary outcome measure for ICU prognostication, with more recent interest in morbidity and functional outcomes.

Predictor (independent) variables should also be objective, clearly defined, reliably measured, mutually exclusive, applicable across institutions, and as free from lead-time bias as possible. To minimize bias associated with model development, these variables should be defined and collected a priori. Data elements may include diagnoses, physiologic status, physiologic reserve, response to therapy and intensity of interventions [24]. These elements must be logical for the intended use of the scoring system. For example, a recent effort to use Pediatric Index of Mortality (PIM2) for cardiovascular ICU patients led to poor predictor performance [25]. This was not surprising given, the paucity of acute physiological variables and cardiac diagnoses included in the score.

Model development is the next element in scoring system design. Typically, individual predictor variables are tested with a univariate analysis for statistical association to the outcome. The variables that are “loosely associated” with outcome (e.g. p < 0.30) from the univariate analysis are then combined in a multivariate analysis. The type of multivariate analysis is outcome specific. Logistic regression is used for dichotomous outcomes such as survival/death. Linear regression is used for continuous variables such as length of stay. Multivariate linear or quadratic discriminate function analysis is most often used for categorical outcomes such as diagnoses [26, 27]. For each independent variable included in the model, a general guideline suggests that there should be at least ten outcome events (e.g. deaths) in the analysis.

Reliability

Reliability of the data elements and the model are vital to a successful scoring system. Clearly defined data elements, precise timing of data collection and standardized training for data collectors all contribute to high quality data acquisition. Reliability of the score can be measured within (intra-rater) or between (inter-rater) observers [24, 27]. The kappa (κ) statistic can be used to measure the level of agreement with 0 representing chance and 1 representing perfect agreement. The type of data determines the most appropriate reliability measurement. Dichotomous data uses the κ statistic, ordinal data uses the weighted κ statistic, and interval data uses the intraclass correlation coefficient [27].

Validity

Validation of a scoring system is the final test to determine whether the score measures what it was designed to measure. A scoring system is often first validated internally. Internal validation can be accomplished by using subsets of the population from which the score was derived. Three common techniques for internal validation include: data-splitting, cross-validation and bootstrapping [27, 28]. Data-splitting involves randomly dividing the sample into a training set and a validation set, with the training set used for initial model development. Cross-validation generates multiple training and validation sets through repeated data-splitting. Bootstrapping tests the model’s performance on a large number of randomly drawn samples from the original population. If a score has good internal validity then it can be externally validated. External validation requires application of the scoring system to a patient population separate from the initial study.

Discrimination and calibration are two common statistical methods that are used to assess the model performance [24]. Discrimination is the ability of a model to distinguish between outcome groups, assessed by the area under the receiver operating characteristic curve. As the area under the curve (AUC) approaches one, the discrimination of the scoring system approaches perfection. The AUC of chance performance is 0.5 and for perfect performance is 1.0. Calibration measures the correlation between the predicted outcomes and actual outcome over the entire range of risk prediction. The most accepted method for assessing calibration is the goodness-of-fit statistic proposed by Lemeshow and Hosmer [29].

Types of Scoring Systems

Scoring systems allow objective quantification of complex clinical states. They can be categorized by the type of predictor variables used to predict outcome. Examples include intervention specific, physiology specific, disease or condition specific and functional outcome scoring systems.

Intervention Specific Scoring Systems

Conceptually, more therapies are provided to sicker patients. The number of interventions that patients receive during their hospitalization can be associated with severity of illness. The Therapeutic Intervention Scoring System (TISS) [30] is an example of this type of scoring system and has been applied to pediatric patients [31]. The initial TISS score included 76 different therapeutic and monitoring interventions scored on a scale of 1–4 based on complexity and invasiveness. Interventions increase with severity of illness, thereby increasing the TISS score which predicts the risk of mortality. However, individual and institutional practice regarding use of interventions may vary which will affect the TISS score independent of the patient’s physiology.

Physiology Specific Scoring Systems

Adult Mortality Scores

The most common adult ICU mortality scores are the APACHE, Mortality Probability Model (MPM), and Simplified Acute Physiology Score (SAPS). APACHE IV uses the worst physiologic values from the first 24 h of ICU admission. Weighted variables including age, physiologic data and chronic co-morbid conditions combined with major disease categories are used to predict hospital mortality [32]. The MPM III collects information at time of ICU admission (MPM0). Age, physiologic data and acute and chronic diagnoses are included in the MPM III. The MPM III adds a “zero factor” term for elective surgical patients that have no risk factors identified by the MPM variables to accommodate for the low mortality risk in this patient population [33]. SAPS 3 is similar to MPM III collecting data at time of ICU admission to predict the probability of hospital mortality. SAPS 3 includes variables relating to patient characteristics before ICU admission and the circumstances of ICU admission in addition to age and physiologic data [34]. Table 6.1 summarizes model characteristics for each of these adult mortality scoring systems.

Table 6.1 Adult and pediatric physiology specific mortality score model design

Pediatric Mortality Scores

The two most common pediatric ICU mortality scores are the PRISM and the PIM. Pediatric scoring systems are similar to their adult counterparts in many respects. The most recent version, PRISM III, was developed from over 11,000 patients in 32 centers in the United States [35]. It consists of 17 physiologic variables collected within the first 12 h (PRISM III-12) or first 24 h (PRISM III-24) of ICU admission (Table 6.1). The most abnormal values at 12 and 24 h are recorded and used to predict the risk of mortality. PRISM has been externally validated [36] and is the first pediatric scoring system to be protected by site licenses.

The PIM2 was developed from over 20,000 patients in 14 centers in Australia, New Zealand and Great Britain (Table 6.1) [37]. Ten variables are collected within the period of time from initial ICU team patient contact, regardless of location, up to 1 h after ICU admission. Conceptually, data is collected early in treatment to more accurately reflect the patient’s physiologic state rather than the quality of ICU treatment rendered. While lead time bias theoretically applies to all scoring systems, there is no evidence that data collected during the first 12 or 24 h of ICU admission adversely affects the validity of a model.

Pediatric Morbidity Scores

Morbidity occurs more commonly than mortality and offers an alternative method for assessment of quality of care. The Pediatric Multiple Organ Dysfunction Score (PEMOD) and Pediatric Logistic Organ Dysfunction (PELOD) were modified from previously developed adult scores [3840] to describe complications and morbidity in pediatric ICU patients [41]. These scoring systems quantify organ system dysfunction based on objective criteria of severity. However, these scores were developed (PEMOD and PELOD) and validated (PELOD only) on relatively small patient populations [42].

Disease or Condition Specific Scoring Systems

Trauma

The most widely used and validated trauma scoring system is the Revised Trauma Score (RTS) [43]. Components of the RTS include Glasgow Coma Scale, systolic blood pressure and respiratory rate. The RTS was initially designed for use by prehospital care personnel to identify adult trauma patients that would benefit from care at a designated trauma center. Subsequently, it has been used to predict mortality from blunt and penetrating injuries and has been validated for use in pediatric trauma patients [4446].

The Abbreviated Injury Scale (AIS) was originally developed to quantify injuries sustained in motor vehicle accidents [47]. The Injury Severity Score (ISS) was adapted from the AIS [48]. The ISS categorizes injuries to six regions of the body and rates the injury severity on a scale of 1–5 (1=minor, 5=critical/survival uncertain). The Trauma Score and Injury Severity Score (TRISS) method combines the RTS and ISS to create a mortality prediction score [49]. TRISS is commonly used for trauma benchmarking and can be applied to the pediatric population [44, 45, 50].

Congenital Heart Disease

Congenital heart disease (CHD) patients are a classic example of a population that requires coordination of multiple services and care systems to achieve a good outcome. Scoring systems for congenital heart disease are based on the diagnosis rather than patient specific physiologic variables. There are three scoring systems that have been developed to assess outcome of these patients. The earliest were the Risk Assessment in Congenital Heart Surgery (RACHS-1) and the Aristotle Basic Complexity (ABC) scores. Both were developed by consensus of experts. In contrast, the most recently developed score, the Society of Thoracic Surgery and the European Association for Cardiothoracic Surgery (STS-EACTS) Congenital Heart Surgery mortality score was empirically derived by retrospective analysis of outcomes from the procedure list defined by the ABC score.

RACHS-1 was the first pediatric congenital heart surgery scoring system developed [51]. This scoring system was developed by an 11 member panel of pediatric cardiologists and cardiovascular surgeons. Data was initially analyzed from over 9,000 patients in the Pediatric Cardiac Care Consortium [52] and hospital discharge data. This score was subsequently refined utilizing multi-institutional databases [53]. Similarly, the ABC score utilized the expert opinions of 50 internationally based congenital heart surgeons to evaluate 145 surgical procedures based on potential mortality, morbidity and technical difficulty [54]. The ABC score was subsequently validated using more than 35,000 cases from the STS-EACTS database [55].

The STS-EACTS score was developed to measure mortality associated with CHD based on analysis of more than 77,000 cases in the STS-EACTS database between 2002 and 2007 [56]. In total, 148 procedures were classified in mortality risk categories from 1 to 5 (1=low mortality risk, 5=high mortality risk). The score was externally validated in 27,700 operations, and showed a higher degree of discrimination for predicting mortality than the RACHS-1 or ABC score [56]. As with previous scores, a major limitation is that the STS-EACTS score does not allow adjustment for patient-specific risk factors.

Other

There are many scoring systems for specific conditions that are potentially relevant to the management of ICU patients. Some examples include scoring systems for croup [57], asthma [58], bronchiolitis [59] and meningococcemia [60, 61]. These scores have been used for triage decision making and severity of illness measurement. However, the ability to validate these models is limited by small patient populations.

Functional Outcome Scores

Functional outcome scores such as Pediatric Cerebral Performance Category (PCPC) and Pediatric Overall Performance Category (POPC) are modified from the Glasgow Outcome Scale [62, 63]. These scores are used to assess short-term changes in cognition (PCPC) and physical disabilities (POPC). Their major drawback is the need for the observer to project a functional status. While the PCPC has been correlated to 1 and 6 month neuropsychological tests such as the Bayley and IQ testing, there is a very large distribution of the neuropsychological tests in PCPC categories. The lack of discrimination would necessitate very large sample sizes if the test were used for long term outcome [64].

The Functional Status Scale (FSS) was recently developed to objectively measure functional outcome across the entire pediatric age range [65]. FSS was developed through a consensus process of pediatric experts and measures six domains of functioning. Domains include mental status, sensory functioning, communication, motor functioning, feeding and respiratory status which are scored from 1 to 5 (1=normal, 5=very severe dysfunction). The Adaptive Behavior Assessment System II (ABAS II) was used to establish construct validity and calibration within each functional domain. The FSS showed very good discrimination with ABAS II categories.

The Future of Scoring Systems

The future of scoring systems is likely to be largely influenced by the changes occurring in the structure and organization of care. Very large databases will become available to health services researchers that will enable more accurate and reliable mortality predictions. More reliable outcome predictions with general models such as PRISM and with disease or condition specific models will be used to assess and benchmark the quality of care in individual pediatric ICUs (PICU). These databases may become large enough to build models with sufficient performance characteristics that they have applicability to the individual pediatric critical care patient. The interest in transparency may dictate public disclosure of this information.

More reliable morbidity assessment methods such as the FSS will shift the paradigm of critical care outcome measurement from mortality prediction to a more global evaluation of functional outcome and morbidity. This will focus quality assessment of PICU therapies on morbidity as well as mortality which will further stimulate advances in quality research, case-mix adjustment methods, and forecasting outcomes. PICU admission data combined with quality of care data will be used to forecast long-term pediatric disability. When the likelihood of new morbidity as well as death becomes part of the outcome prediction models, they will have much greater applicability and utility to individual patients and individual patient decisions.

Conclusion

Scoring systems in intensive care medicine have increased in number and utility over the past 30 years. They are increasingly applicable to clinicians and health services researchers and in the future may be applicable to individual patients. Clinical scoring systems provide a standardized method for ICU benchmarking and are required by governing health care bodies. Additionally, benchmarking information is used by health care payers in creating managed care contracts. Consumer demands for error-free medicine, quality improvement and transparency may lead to public scoring of hospitals and individual physicians. These forces will continue to increase the need for risk adjusted outcomes and institutional benchmarking. It is in the intensivist’s best interest to understand scoring systems, their applications and implications.