Introduction

The quality of emergency care and outcomes after multiple trauma may vary importantly between different countries and individual hospitals [1]. Survival of the multiply injured patient remains the primary objective of treatment, but mortality continues to be high, especially in the presence of major head trauma or serious hemorrhage [2]. The probability of survival depends on the trauma, the patient, and the quality of care received. There is an increasing interest in risk scoring systems in surgery and external benchmarking for trauma center performance [35]. Trauma scores may serve as instruments of quality control for the systematic comparison of patients and institutions [6]. Even though outcome prediction was found to be insufficient for individual decision making in clinical situations [7], it might serve as a tool for quality assurance in the comparison of therapeutic results. To improve the quality of care, it should be possible to compare the performance data from different institutions based on an internationally accepted standard [8]. Susan Baker’s statement on this issue should be remembered: “If you have never felt the need for any type of severity scoring system, then you probably have never had to explain how it is that the survival rate of 58% in your trauma center is actually better than the survival rate of 97% in some other hospital where the patients are much less seriously injured” [9]. The multifactorial sequence of actions following multiple trauma and the various possible confounders make it obvious that one single parameter cannot be sufficient for the comparison of data, but scores that combine several variables may be helpful. Thirty years ago original investigations reported on trauma severity scoring to predict mortality and evaluate trauma care [10, 11]. Over the years, several modifications of original scores have been developed, and more complex prediction models have been designed to improve comparison of a hospital’s expected outcomes with its actual outcomes [12]. However, the fact remains that there is still no international consensus on which score is the most reliable. The relevant articles are numerous, and reports are often confusing because of conflicting findings or noncomparable data. Any interpretation of raw hospital data has to take possible confounders into account—such as different trauma populations or health care structures—with the consequent need for adjustment in the case mix being studied [1315]. But studies in the field continue to publish raw data—for example, mean mortality rate—without including sufficient information on the composition or management of the cohort investigated [16]. This makes any reasonable comparison of the quality of treatment between different centers impossible. In certain areas, participation in a regional multicenter database program, such as the National Trauma Data Bank [17], the Trauma Audit Research Network database [18], or the German Trauma Registry [19], may offer an opportunity to compare data from the contributing hospitals. Even so, data recruitment for national registries varies widely, a finding that underlines the lack of uniform benchmarking [20]. Furthermore, there are still many regions in the world without any form of regional data collection. Irrespective of the extensive body of literature on the subject, an internationally accepted standard of mortality prediction to permit adequate quality assessment is still missing [5, 8, 21, 22].

Our investigation posed two main questions: How good is the management of multiply injured patients in this institution in terms of actual mortality rate? and Which prediction model will emerge as the most useful for routine clinical use? In the context of a typical single-center scenario, we therefore tested the practicability of several scoring systems as discussed in the literature and composed of routine parameters used worldwide in the emergency treatment of multiple trauma patients. We focused mainly on comparison of the multiple variations reported for the most frequently applied mortality prediction model, the Trauma and Injury Severity Score (TRISS) [11, 23, 24].

Materials and methods

Patients and procedure

From August 2001 to April 2005, 506 consecutive trauma patients arrived in the emergency room (ER) of our university hospital with suspected multiple trauma and were entered into a prospective database. Multiple trauma emergency treatment guidelines were followed according to international standards: ER diagnostics routinely included Focused Assessment with Sonography for Trauma (FAST) [25] and conventional x-ray for thorax and pelvis, followed by multislice CT, if needed (including clarification of the cervical spine). In every case, at least the trauma team leader (trauma surgeon or general surgeon), but more often several members of the interdisciplinary resuscitation team, had completed an Advanced Trauma Life Support (ATLS) course [26]. Multiple trauma patients were defined as trauma victims where (1) at least two Abbreviated Injury Severity (AIS) [27] regions were affected, and (2) the Injury Severity Score (ISS) was >16 [28]. Patients with monotrauma, ISS ≤ 16, and those secondarily admitted from another hospital were excluded from the analysis (n = 269; 30-day mortality rate 8.6%). For every emergency case medical students specifically trained for the study but not involved in the treatment of patients were alerted and called to join the team on duty. They were present during the diagnostic and therapeutic process in the resuscitation bay, and they documented the clinical procedure as well as all available laboratory data in a standardized fashion. Prehospital variables were extracted from the ambulance or helicopter documentation. Patients’ demographic data and the variables needed to determine all scores were collected prospectively on admission, and the first available value (prehospital or emergency room) was used for analysis: age, ISS [27], GCS [29]; systolic blood pressure [SBP] [30]; arterial hemoglobin saturation by pulse oximetry [SapO2] [31]; breath rate [BR] [32]; heart rate [HR] [33]; shock index [SI] [34]; hemoglobin [Hb] [35]; prothrombin time test [Quick’s value, PT] [19]; base excess [BASE] [36]; and lactate [37]. The whole data collection was subsequently reviewed and completed by fellow surgeons. The calculation of scores and models was undertaken according to the cited literature for the AIS and ISS, respectively, based on the follow-up data obtained on hospital discharge [27, 38, 39]: ISS, Revised Trauma score [RTS] [40]; Triage Revised Trauma score [T-RTS] [40], PTS [41]; Bouamra score [42], Simplified Acute Physiology Score [SAPS II] and resulting predicted mortality [43], and the TRISS in the versions of Boyd (Boyd-TRISS) [11], Champion (Champion-TRISS) [23], the Major Trauma Outcome Study (MTOS-TRISS; www.trauma.org/index.php/main/article/387), the Trauma Audit & Research Network (TARN-TRISS) [44], and the National Trauma Data Base (NTDB-TRISS) [24].

Death and time of death were recorded for every patient, and mortality rate was assessed after 24 h, 30 days, and 2 years following trauma. The study was approved by the local ethical committee, and informed consent was obtained from patients or relatives if possible.

Statistics

The results are presented as mean ± standard deviation (SD), if not stated otherwise. All statistical tests were two-tailed. Student’s t-test was used for comparison of means in normally distributed data of continuous variables; analysis of variance (ANOVA) was used for similar criteria in three or more unpaired subsamples. Chi-square analysis was used to test categorical data. Routinely documented variables suspected or known from the literature to be possible factors associated with mortality were analyzed, resulting in the parameters for univariate analysis with the primary endpoint mortality (24 h and 30 day): age, ISS, GCS, SBP, SapO2, BR, HR, SI, Hb, PT, BASE, lactate, RTS, PTS, and single TRISS-versions.

Stepwise regression analysis was performed by including all factors that were found to be significant in univariate analysis. Starting with the major variables age, ISS, and GCS in the first three steps, other significant factors of the univariate analysis were added separately in a fourth step, whereby missing data were replaced by the mean. The stepwise model was executed in order to evaluate any additional information obtained from subsequently added parameters using mortality as the dependent variable and any variable under investigation as an independent variable. Nagelkerke R2 was calculated to estimate variance of the model, and the chi-square statistic was used to calculate the significance of the improvement of the model.

Mortality scores were calculated in accordance with the relevant descriptions in the literature. Most of these scores contain the RTS, which is calculated using GCS, blood pressure, and breath rate as a subscore. As the RTS is reported to have a high rate of missing values in the literature, scores were (1) calculated and compared for the sample without missing values (n = 144) and (2) coded for missing breath rates (90 of 237 cases) according to the nonpathological category of breath rate to avoid overestimation of mortality, and missing blood pressures were entered as the mean of the fully documented sample (3 of 237 cases) [44]. Calculation of TRISS values for the group of patients with missing variables (breath rate) and comparison with those for the rest of the study cohort did not reveal any significant difference between groups in terms of age, ISS, GCS, predicted SAPS II, or observed 30-day mortality. For example, initial breath rate was missing in 93 patients (mean age: 43.4 ± 20.2; mean ISS: 29.3 ± 10.9; mean GCS: 10.9 ± 4.4; predicted SAPS II mortality 76.9%; 30-day survival 73.1%) compared with 143 patients with these data (mean age: 42.5 ± 21.4, P = 0.732; mean ISS: 29.6 ± 11.9, P = 0.812; mean GCS: 10.2 ± 4.7, P = 0.239; predicted SAPS II mortality 75.0%, P = 0.598; 30-day survival 80.9%, P = 0.229).

Scores were tested by generation of receiver operating characteristic curves (ROCs) and by comparison of the areas under the curve (AUC), which are reported with 95% confidence intervals (CI). The precision of the models investigated (AUC) differed only minimally if patients with missing values were excluded from the subsequent analysis or if missing values were substituted in this standardized fashion. Data are only shown for the latter.

The Z-statistic was calculated for definitive outcome-based evaluation (DEF) [45]. In DEF, Flora’s Z-score quantifies the difference between the actual number of deaths (or survivors) in the test subset and the predicted number of deaths (or survivors). The formula for calculating Z is: Z = D  (q i /p i Q i ), where D is the actual number of deaths, Q i is the predicted probability of death for a patient i, q i the predicted number of deaths and p i the predicted P s for patient I [46]. A Z-score with an absolute value of >1.96 is statistically significant [47]. As our study population (due to the inclusion criteria presenting with a higher percentage of severely injured patients; M < 0.88) differed importantly from the Major Trauma Outcome Study cohort, further statistical analysis or interpretation was not undertaken [47, 48].

To calculate sensitivity and specificity statistics, mortality scores were dichotomized around the value of 0.5. Data were analyzed with SPSS version 13.01 (SPSS Inc., Chicago, IL), and a P value <0.05 was considered significant.

Results

The prospective cohort consisted of 237 consecutive multiply injured patients (mean age: 42.8 years; 73.4% male; 3.4% penetrating trauma) with a mean ISS of 29.5 (Table 1). Trauma cases were the consequence of traffic accidents (65.0%), falls from a height (18.6%), or other causes (16.4.9%). The average length of stay in the intensive care unit (ICU) was 6.2 days and at our hospital 13.0 days. The 24-h mortality rate was 10.1% (n = 24); 30-day mortality, 22.8% (n = 54); overall 2-year mortality, 24.1% (n = 57).

Table 1 Descriptive values and predicted survival rate for patients in series

Univariate analysis on 30-day mortality found significantly higher values—that is, worse values—for non-survivors for age, GCS, SpO2, Hb, PT, lactate, hospital stay, ISS, T-RTS, Champion-TRISS, and NTDB-TRISS, as well as SAPS II, in comparison to survivors (Table 2). None of the variables under investigation showed an obvious cut-off value beneath which survival could be ruled out definitively.

Table 2 Univariate analysis for 30-day survivors versus non-survivors

The prediction model for 30-day-mortality including the most significant variables age, ISS and GCS showed an overall variance of 46.7% (Table 3). Further stepwise logistic regression found that both the initial hemoglobin and the prothrombin values still added some significant information to the model (2.1%, P = 0.02 resp. 2.4%, P = 0.32). When both variables (Pearson correlation 0.50) were added to the model combining age, ISS, and GCS, no significant improvement could be demonstrated. With regard to 24-h mortality, age, ISS, and GCS together explained 48.3% of deaths. In addition, SapO2 contributed significant information, with an overall variance of 53.4% (P = 0.006); Hb, with an overall variance of 51.6% (P = 0.026); and both parameters together, with an overall variance of 56.7% (P = 0.002; detailed models not shown).

Table 3 Logistic regression of prognosis for 30-day-mortality (Model 3 is used for the calculation of the BMTPM)

On the basis of these findings we drew up the “Basel Multiple Trauma Prediction of Mortality” (BMTPM) for the prognosis of 30-day mortality in our investigation cohort, including age, ISS, and the GCS: BMTPM = 1/(1 + (1/(EXP(5.686 − 0.044 * age − 0.116 * ISS + 0.129 * GCS)))). The validation of this score in a similar cohort of 172 multiple trauma patients (in-hospital mortality 22%) previously treated at our institution [49] showed that BMTPM predicted a survival rate of 78.8% with an AUC of 0.805. We tested this score by comparing its discriminative value with that of the two variables showing significance in multivariate analysis as well as the mortality prediction scores under evaluation (Table 4). Figure 1 illustrates the range of the ROC-curves. After we added the hemoglobin and prothrombin variables to the model, the AUC improved from 0.86 to 0.87.

Table 4 Comparison of variables/models predicting 30-day survival (descriptive values and area under the curve, AUC)
Fig. 1
figure 1

Comparison of selected models in the prediction of 30-day mortality (ROC curves). ROC Receiver Operating Characteristic curves, TRISS Trauma and Injury Severity Score, in the versions of Boyd (Boyd-TRISS) or the National Trauma Data Base (NTDB) as NTDB-TRISS, and the Basel Multiple Trauma Prediction of Mortality (BMTPM)

Depending on the score used, predicted mortality ranged from 13% (NTDB-TRISS) to 31% (Bouamra score). Table 4 lists the corresponding Z-scores, which differed significantly above and below the actual observed mortality (BMTPM = 0), i.e., between −5.7 (NTDB-TRISS) and 3.7 (Bouamra score). The actual mortality in our cohort (22.8%) was equally predicted by several TRISS-versions (e.g., Champion-TRISS: mean 0.78, i.e., 22% predicted mortality; MTOS-TRISS: mean 0.77, i.e., 23% predicted mortality) and SAPS II. Of all the models tested, the NTDB-TRISS emerged as the score that combined high precision (AUC 0.84) with the highest benchmark level in the prediction of 30-day mortality (mean 0.87, i.e., 13%). The sensitivity and specificity of all scores under evaluation was lower if elderly or more seriously injured patients were compared to younger or less seriously injured ones (an example of the Champion- and the NTDB-TRISS is given in Table 5).

Table 5 Sensitivity and specificity of Champion-TRISS and NTDP-TRISS in different subsamples (according to Demetriades [47, 59])

Discussion

We report three major findings from a search for mortality benchmarking in a prospective cohort of multiply injured patients:

The actual 30-day mortality rate of 22.8% exactly matched the predictions of the TRISS-versions of Boyd, Champion, and the Major Trauma Outcome Study.

The TRISS using newer National Trauma Data Bank coefficients (TRISS-NTDB) was found to combine the highest requirement profile with precision in the prediction of 30-day mortality.

Our investigation showed that comparison of outcomes with reported values is still almost impossible for a single center not integrated into a trauma registry.

Can we as clinicians be satisfied with our 30-day mortality of 22.8%—a rate that precisely matched the mortality predicted by the early TRISS versions? In 1983 Champion stated that injury severity scales of proven reliability and validity are essential to the accurate prediction of outcome [10]. Quantitative assessment of treatment quality for different multiple trauma cohorts requires well-defined and accepted benchmarking data, e.g., to predict the 30-day mortality rate [5, 8, 50]. Reliable comparison between centers is impossible based on raw or grouped outcome data alone [17, 35]. Without published standardized data, an apparent discrepancy of outcomes may be misleading. Following adjustment for differences in the case mix of individual hospitals, the National Study on the Costs and Outcome of Trauma (NSCOT) showed that the risk of death is significantly lower at a level I trauma center than at a non-trauma center [14]. But, a recent investigation revealed that, in terms of survival, half the American College of Surgeons (ACS)-verified level I trauma centers performed significantly differently from their risk-adjusted expectations [13].

Most workers in the field agree that medical care of multiply injured patients has improved qualitatively in recent decades. The trauma registry of the German Society of Trauma Surgery showed a reduction in mortality from 1993 to 2005 from 22.8% to 18.7% [51]. In contrast, other European hospitals have reported unchanged or even increased mortality rates over the last 10–15 years, although ISS and trauma patterns have remained comparable [52]. Not only the quality of treatment may change over time: In recent years centers in Austria and the Netherlands have reported a decreasing mean ISS, an increased mean patient age, and a higher percentage of severe head injuries among the multiply injured [16, 53]. Given these discrepant and not uniformly adjusted data on mortality [18], individual centers can hardly find reliable data with which to compare their own performance and evaluate their treatment of multiple trauma patients. Participation in national multicenter databases may offer regional solutions, with the potential advantage of eliminating geographic variations in patient characteristics, the epidemiology of injury, or the health care provided. Nevertheless, from an international perspective, registers using different inclusion criteria and their own optimized assessment scores only fail to permit international comparison [5456].

Our study population is representative of a typical European multiple trauma cohort in the upper range of trauma severity [9, 16, 57], with a mean age of 43 years, an ISS of 30, and a large majority of blunt injuries following traffic accidents. As for studies based on the German Trauma Registry [51] the focus of our investigation was on the primary treatment of multiple trauma only and excluded patients with mono- or minor trauma as well as those admitted secondarily in order to avoid major confounders. With the objective of appraising the raw 30-day mortality rate in our cohort of multiple trauma patients adequately, we tested several prediction models published in the literature, e.g., different variations of the TRISS. We came to realize that several authors had not stated which version of the TRISS they actually used [44, 46, 58, 59]. The TRISS and the coefficients it includes were originally developed in the 1980 s by Boyd et al. based on the US-American Major Trauma Outcome Study (MTOS) database for motor vehicle accidents [11]. Since then, the need to update or adjust these coefficients has been discussed [60, 61] and various modifications of the original Boyd-TRISS have been published in an effort to predict mortality more precisely [23, 24, 44, 60]. Despite some important criticisms and the subsequent development of several other prediction models [51, 57, 58, 62], the TRISS still figures as the internationally most frequently applied instrument for the assessment and adjustment of injury severity facilitating comparison of survival rates for (multiple) trauma patients [24, 44, 55, 63].

Our actual rate of non-survival (22.8%) was equally well predicted by the Champion [23]- and MTOS [23]-TRISS (AUC 0.83 and 0.84, respectively). From a statistical point of view and given the minimal differences in predicted survival and precision (AUC values) of these models, as well as the observed deviations in comparison to the similar Boyd-TRISS or SAPS II, we cannot advocate any single outperforming score. In 1997 the Cologne Validation Study showed that both the original Boyd and the subsequent Champion versions of the TRISS were highly accurate for predicted mortality, even though trauma patients and care are different in the US and Germany [9]. Other European reports on trauma populations not confined to multiple trauma support this finding for the Boyd-TRISS [64]; others observed a better than expected survival rate, regardless of whether newer MTOS- or TARN coefficients were used [44]. The fact that the TRISS can identify major differences in levels of multiple trauma care is clear from observed mortality rates in developing countries that are two or three times higher than the TRISS mortality prediction [46, 65]. In contrast, an unrealistic >50% rate of high-performance hospitals has been reported regarding TRISS and ASCOT-predicted mortality data using older MOTS- formulae [12]. An updated TRISS with new NTDB-derived coefficients yielded an improved mortality prediction compared to the older MTOS [66]-TRISS [55].

Against the background of this debate, we tested the correlation of routine emergency parameters with mortality, in a first step, and the predictive ability of the resultant combined scores, in a second step. As expected, combined scores outperformed the single emergency parameters in the prediction of mortality. The treatment-restricted variables age, ISS, and GCS were found to be the main independent prognostic parameters in our multivariate analysis. The prominent importance of each of these three variables, all being part of the TRISS, has been described before [6769]. Historically, the inclusion of additional parameters into any prediction model attracted heightened clinical interest if these were associated with a potentially therapeutic option, such as the correction of base excess, or other specific optimization of treatment [21, 54]. However, other studies found that the existence of a cause-and-effect relationship could not be proven even when several risk factors were clearly associated with poorer outcome [2, 68]. In contrast to other investigators [4, 35, 70], we found that, in addition to these three variables, only the initial hemoglobin value added some small prognostic information on both 24-h and 30-day mortality in multivariate analysis. Initial prothrombin time appeared to have a comparable impact on 30-day mortality only. Our observation that this effect was no longer statistically significant if the hemoglobin and prothrombin times were both added to the model provides indirect evidence for the strong association of these two measurements. Of course, this finding makes sense from a clinical point of view and supports published reports on blood loss and traumatic coagulopathy [19, 54, 71]. Unfortunately, the fact that we could not adequately record transfusion volume requirements or pre-trauma anticoagulant status meant that we had to forego interpretation of these two values (i.e., 10.1% for hemoglobin and 16.9% for prothrombin). In addition, the limited statistical power of our cohort in comparison to larger databases may explain why we could not clearly identify any additional variables of influence.

In recent years numerous prediction models have been described in the literature, yielding conflicting data in terms of constitution and precision. A prognostic model including base excess and prothrombin time as significantly independent factors in addition to age, ISS, and GCS reached an AUC of 0.90 in the German Trauma Registry [72, 73]. Lackner et al. confirmed this finding for another center also participating in that registry [70], but with a lower AUC (0.82). Astonishingly, even though we did not include the two variables base excess and prothrombin time in our prognostic BMTPM-model we observed a higher precision for the prediction of mortality (AUC 0.86). A disadvantage of adding various variables to any existing prediction model is that it becomes more complicated and, in effect, more prone to dropout in terms of missing values. The two parameters base excess and prothrombin time are particularly susceptible to a high missing rate in the emergency situation [19, 74], which may then negate the enhanced precision achieved by adding the variables. Furthermore, adding hemoglobin and prothrombin added only marginal precision to the BMTPM-model (+0.01 AUC). The relatively simple BMTPM, with its advantage of avoiding missing breathing values, was found to have the highest prognostic value in comparison to all other variables tested in our study, and it was equal to or even better than standard scores like the TRISS or SAPS II, which include more parameters but are challenged by the problem of missing values. Of course, a high prognostic value will be expected because the BMTPM was developed in this specific cohort. Our finding of a lower AUC when testing the model in a retrospective cohort supports this expectation. The equally precise SAPS II score has the disadvantage that it needs many more subvariables for determination.

A few authors report even higher precision for their prediction models, e.g., an AUC of 0.91 for the SAPS II, 0.95 for the Bouamra score [57] or an AUC of 0.97 for a version of the TRISS that is not clearly specified [58]. In comparison with our investigation, the observed differences might generally be explained by the lack of adjustment for different study cohorts. For example, Demetriades et al. demonstrated a higher misclassification rate when applying TRISS methodology to elderly patients or those suffering more severe trauma [47, 59], and as a result, they concluded that this approach should be discontinued. Even though we also observed reduced sensitivity and specificity in mortality prediction for all the scores under investigation in the relevant patient subgroups, their final conclusion has to be critically reviewed. From a statistical perspective, the discriminatory power of any scoring system to predict mortality will be increased if more extreme outcome cases are included in the analysis. In large general trauma databases with predominantly mild trauma patients, there is always an overwhelming probability (>95%) of survival, in contrast to an almost negligible probability of non-survival. In this light and given the complexity of most prediction scores, it is debatable whether the application of a specificity model to favor selected subgroup analysis is plausible in clinical routine.

Overall, in our collective, the quality of the prediction models under investigation differed only marginally between AUC 0.82 and 0.86. What did differ was the predicted mortality rate of the corresponding scores, which ranged from 13% (NTDB-TRISS) to 31% (Bouamra score). The Bouamra score with its complex methodology includes gender as well as age, ISS, and GCS, but even so, yielding an almost identical AUC in comparison to all other models in our cohort clearly overestimated mortality. For the purpose of quality assurance the equally precise NTDB-TRISS with its lower predicted mortality provided the highest target benchmark. A more simplistic approach would be just to aspire to obtaining a higher survival rate than that predicted by any of the prediction models, but this carries the disadvantage of not knowing the extent of success. As discussed earlier, the evidence that the NTDB-TRISS provides a more adequate assessment of the level of performance aspired to today favors our approach [12]. Consequently, our mortality rate of 9% higher than predicted with the newer NTDB-TRISS supports the conclusion that our treatment outcomes can be improved.

Apart from this challenging clinical implication, our investigation illustrates the need for a well-accepted international benchmarking process in the assessment of mortality for the multiply injured patient. Understandably, clinicians are concerned about whether any such method actually reflects quality of care, especially if trauma centers are to be ranked by mortality rate [13, 75]. In our opinion, such quantitative statistical outcome analysis will only be a first step in the quality assessment of multiple trauma therapy at individual centers. Undoubtedly, detailed qualitative analysis of critical clinical processes as well as case report forums, such as interdisciplinary morbidity and mortality conferences, remain indispensable for achieving any further improvement of treatment [47, 76].

This study has several limitations:

We restricted our investigation primarily to 30 day-mortality. This prospective consecutive series represents a single-center experience with the disadvantage of a limited case number, and it cannot compete with large national databases. Our perspective was different because we wished to assess quality of performance in an institution that is not part of a larger registry by comparing actual mortality to reported values. Because our study cohort included multiply injured patients only, statistical comparison with general trauma registries such as MTOS, NTDB, or TARN was not possible due to the missing match of study populations. Independent of this fact, any further matching procedures, including M-, Z- and W-statistics, are highly sophisticated and hardly realistic for single centers that are not integrated into large national databases [45, 48, 63]. In contrast to other researchers, we did not exclude either major head trauma [30], patients who did not survive the first 24 h [77] or patients for whom some values were missing [55]. Our analysis was restricted to the parameters that are routinely obtained in the treatment of multiply injured patients, and therefore we cannot comment on other variables frequently discussed in this context, such as cytokines or inflammation markers.

Unlike other studies [78, 79], we did not find any impact of gender on outcome. As is typical for a mid-European hospital, we experienced a very low rate of penetrating trauma, which limits comparison with centers reporting on a much higher incidence of violence-related trauma [9]. Similarly, we cannot comment on the potential impact of race on outcome as reported, for example, in studies from the United States. [80]. We did not investigate the effects of variables such as co-morbidity, medication, or transfusion requirements, all of which are under debate as major contributors to outcome but which are difficult to obtain and to score in a reliable manner in the emergency setting of multiple trauma [54, 71, 81]. Because our study focused on the prediction models discussed here, we cannot comment on the effect of other scores or stratification methods [20, 46, 55, 82].

Another well-known problem of any clinical study analyzing numerous parameters is the handling of missing values, especially in the emergency treatment period, e.g., lactate, breath rate or the resultant scores. Some of the reported differences in the accuracy of prediction models may, in reality, be due to variable completeness of data acquisition for the different study cohorts. The trauma registry of the German Society of Trauma Surgeons reports a 39% rate of cases with missing base excess data [74], and a large Canadian trauma database reports a missing rate of almost 40% of GCS data and 5.6% of respiration rate values [40]. The latter accounted for 22% of deaths [32]. The exclusion of patients with missing values from further analysis [55] may not only diminish the number of cases analyzed but also, and even more importantly, may create an undesirable bias, possibly related to the specific composition of any particular cohort [83]. We observed an important number of missing values for base excess, lactate, and respiration rate, with the consequence that it was impossible to calculate the RTS and TRISS for those patients. The subgroup of patients with missing values did not show any significant differences from the rest of the cohort with regard to the variables tested. To compensate for this problem, we decided to uniformly impute neutral values to substitute for missing cases [44]. The imputation method has been proven to yield hospital quality measures that are almost identical to those based on the true data [32, 83, 84]. Other groups tried to solve the problem of missing data by choosing a mortality prediction model in which variables with a higher rate of missing values, such as the RTS, were substituted by another variable with low or no missing data, for example, the GCS [42, 85]. In the light of the finding that Bouamra scoring did not perform as well as other TRISS versions in our study cohort, we cannot advocate this preference.

Other inconsistencies in the way raw data are handled by different investigators in emergency medicine concern the time at which ISS diagnosis is made, i.e., on patient admission or at the end of the hospital stay. We decided in favor of the latter because of the well-known tendency toward early underscoring of trauma in emergency situations. In addition, differences may arise depending on whether field values or emergency department measurements are used for the first evaluation and subsequent scoring of patients [85].

Overall, Clark has reasoned that, because there are so many possible limitations, the comparison of institutional trauma survival to a standard value will continue to be a challenge [86].

In summary, with the aim of assessing our institution’s quality of performance in the treatment of multiply injured patients, we tried to compare it with the mortality rates reported in the literature. Lacking an international standard, we needed to find out which of the investigated prediction models would emerge as the most useful for clinical routine. In our investigation, despite all constraints, the NTDB-TRISS appeared to provide the best score, simultaneously combining high statistical precision with the highest therapeutic aims in terms of patient survival after multiple trauma. For today’s single center without the chance of participating in a large, high-quality trauma registry, we recommend the use of the NTDB-TRISS as an important first step toward establishing a critical quality assessment of the treatment of multiple trauma patients. Our investigation underlines the clinical need for a benchmarking procedure, accepted worldwide, that would be an integral part of standardized quality management for all multiply injured patients.