Introduction

Reliable, objective, and routine assessment of a patient’s condition after surgery may provide important information for improving postoperative management and support efforts to improve quality of care and patient safety. Based on data routinely collected as part of the National Surgical Quality Improvement Program (NSQIP), a simple three-item, 10-point Surgical Apgar Score was developed to predict the occurrence of major postoperative complications and 30-day mortality (Table 1) [8]. The Surgical Apgar Score is intended to supply an immediate and easily calculated objective summary evaluation of a patient’s condition after surgery to identify patients at high risk for major complications and to provide an objective summary for hand-off communication between different teams. Closer monitoring of postoperative patients with low scores (eg, a score of 4 or less), may allow prevention of complications. The score was intended to be used as a modifiable target for surgical teams and researchers aiming to improve outcomes by serving as a measure for quality improvement programs. Practices were advocated to be developed to reduce the number of patients with low scores. However, it has not been tested in patients undergoing joint arthroplasty, and no similar score has been developed for such patients.

Table 1 The 10-point Surgical Apgar Score

For patients undergoing hip and knee arthroplasties, we asked (1) whether the score provides accurate risk stratification for major postoperative complications, and (2) whether it captures intraoperative variables contributing to postoperative risk independent of preoperative risk.

Patients and Methods

Data were extracted from the medical records of all 3511 patients undergoing primary revision hip and knee arthroplasties from March 1, 2003, to August 31, 2006, at Massachusetts General Hospital. Data for primary diagnosis, procedure, comorbidities, intraoperative variables, and immediate outcomes during hospitalization were collected from electronic clinical data and electronic administrative records at Massachusetts General Hospital. The study protocol, including a waiver of informed consent for individual patients, was approved by the Massachusetts General Hospital Human Research Committee.

Data for each patient were abstracted manually from discharge summaries, operative notes, and ICD-9 codes by one investigator (THW). Primary diagnoses and procedures were abstracted from the operative reports and discharge summaries. In case of conflicting information, the operative report took precedence. To assess interrater reliability, a clinical fellow (THV) was first instructed on the abstraction techniques and then abstracted data for the same variables regarding comorbidities and major complications from a random sample of 50 patients taken from the database. Abstraction of outcomes by an independent rater yielded 100% agreement for all the variables.

Primary diagnoses and indications were collapsed into the following categories: osteoarthritis (including dysplasia and slipped capital femoral epiphysis); rheumatoid arthritis (including inflammatory etiologies, such as villonodular and psoriatic and heterotopic ossification); infection-related joint arthroplasty; mechanical (including dislocation, aseptic loosening, failed allografts, pseudarthrosis, wear, and osteolysis); avascular necrosis; posttraumatic changes; and benign or malignant tumor-related joint arthroplasty.

Intraoperative parameters included estimated blood loss (EBL), lowest heart rate (HR), and lowest mean arterial pressure (MAP). Intraoperative records were stored in an electronic Anesthesia Information Management System (Saturn; Dräger Medical, Telford, PA, USA). This database is accessible via Structured Query Language (SQL). A SQL query was developed to examine the intraoperative physiologic data during the surgery. Electronic anesthesia data differ from handwritten records in multiple aspects [5, 21]. Specifically, the tendency for inclusion of some artifactual or erroneous values (for example, false pressure readings when an arterial catheter is flushed) is of concern. Therefore, we used a previously validated filtering algorithm to eliminate artifactual readings [19, 21].

The 10-point Surgical Apgar Score was computed for each patient based on three intraoperative parameters (EBL, HR, MAP) (Table 1). Derivation of the score was based on a logistic regression model as previously reported [8].

The primary end point was the incidence of a major postoperative complication or death during hospitalization. Major complications were identified from diagnoses in discharge summaries, operative reports, and ICD-9 codes and are based on definitions from the NSQIP [13] (Table 2).

Table 2 Individual major complications

Basic demographics and summary statistics were calculated overall and for those with and without major complications. There were no missing data for analysis of the intraoperative variables and the outcomes. For development of the preoperative risk model, only complete cases with complete data were used; as some laboratory values were missing for some patients, the overall number of patients for analysis decreased from 3511 to 3236 for this model. For all variables, including the three intraoperative variables of interest, differences between patients with and without complications were compared using the two-sided t test or Wilcoxon rank-sum test for continuous variables and chi square test for categorical variables.

To assess predictive performance, we used univariable logistic regression, with the 10-point score as a categorical predictor, to evaluate the calibration and discrimination of the score as a comprehensive predictive instrument for major complications or death in the arthroplasty cohort. We evaluated calibration with calibration graphs and the Hosmer-Lemeshow goodness-of-fit test [12]. This test is commonly used to assess for goodness of fit for logistic regression models. It tests whether the observed event rates match expected event rates. If observed and expected event rates are similar, the model is called well calibrated. If the p value of this test is more than 0.05, there is no evidence of lack of fit. Discrimination, as a measure of how well the score can differentiate patients with and without complications, was assessed by the c-statistic. Closely related to sensitivity and specificity, the c-statistic represents the percentage of all possible discordant pairs of cases in which the model correctly assigns a higher probability of having a major complication to the patient with the complication rather than to the patient without the complication.

To investigate the stability of the score in relation to mortality status, sensitivity analysis was performed by subgroup analysis, with patients who had died being excluded.

To analyze whether the score contributes incremental information regarding patient risk independent of baseline clinical variables, we derived a preoperative risk model based on clinical variables, easily obtainable before surgery, using stepwise multivariable logistic regression to predict major complications. Variables tested for inclusion in the model included age, gender, weight, height, American Society of Anesthesiologists (ASA) class, creatinine, estimated glomerular filtration rate, blood urea nitrogen (BUN), and diagnosis of femoral fracture. Preexisting conditions including diabetes mellitus and pulmonary and cardiovascular comorbidities also were evaluated as risk factors (Table 2). Pulmonary comorbidity was defined as preexisting chronic obstructive pulmonary disease, ventilator dependence, or pneumonia. Cardiovascular comorbidity was defined as prior myocardial infarction, angina, congestive heart failure, or coronary revascularization. Patients with a history of transient ischemic attack (TIA), or stroke, with or without residual neurologic deficit were pooled in one group, called “history of stroke/TIA” and included in cardiovascular comorbidity. Nonphysiologic outliers and influential points were checked individually for the logistic regression models, and variables were truncated accordingly when necessary. Model diagnostics were performed with residual plots. The c-statistic for this model was calculated to summarize discrimination and the Hosmer-Lemeshow goodness-of-fit test was used to assess calibration.

Patients were stratified into equal-sized quintiles based on their preoperative risk as determined by the derived model. A Spearman correlation was used to compare preoperative risk stratification and the score. Outcomes rates for increasing levels of the Surgical Apgar Score (categorized as 0–4, 5–6, 7–8, and 9–10) were calculated for each quintile of preoperative risk. A Cochran-Armitage chi square test of trend was performed separately for each preoperative risk quintile to see whether, in each level of preoperative risk, outcome rate was related to the intraoperative risk as measured by the score category. We also used logistic regression to assess whether the score added incremental prognostic information independent of preoperative risk, again using the incidence of major complications as outcome and controlling for preoperative risk predictions. A likelihood ratio test was computed to compare the model with preoperative risk predictions as the only independent variables with a model that included the preoperative risk predictions and the score. The results are presented as the odds ratio, 95% confidence interval, and corresponding p value for the score, adjusted for preoperative risk. All analyses were performed using the SAS® 9.1 statistical software package (SAS Institute Inc, Cary, NC, USA).

Results

Patients who experienced major complications were older and more likely to be male, have cardiovascular disease, pulmonary disease, or diabetes, and an ASA Class III or greater. The likelihood of complications varied: among procedures, the highest rates occurred for hemiarthroplasty and revision THA; fractures and infections were the highest-risk diagnoses (Table 3).

Table 3 Preoperative risk factors and in-hospital outcomes.

The score did not provide comprehensive risk stratification in hip and knee arthroplasties. The score achieved a c-statistic of 0.61. There were no discrepancies between observed and predicted outcome rates (Hosmer-Lemeshow chi square statistic p = 0.20); however, calibration curves (Fig. 1) showed the observed outcome rate in this orthopaedic cohort was consistently lower than that predicted by the score for general and vascular surgery. Excluding mortality from the analysis did not appreciably affect model performance (c-statistic = 0.60).

Fig. 1
figure 1

A calibration plot compares the model’s predicted probabilities based on the original general and vascular surgery cohort and observed proportions in our arthroplasty cohort. Triangles display deciles of the predicted probabilities in the arthroplasty cohort. The diagonal line reflects the ideal situation (predicted probability = observed proportion). The curve represents the relation nonparametrically. The histogram in the lower part of the figure shows the distribution of predicted probabilities in this sample where the height of each line represents the number of patients shown on the y-axis on the right side of the figure. Probabilities greater than 50% were truncated. This plot shows the score constantly underpredicted major complications and as such is insufficiently accurate to serve as a comprehensive risk stratification tool.

The score captured intraoperative contribution to postoperative risk independent of preoperative risk. Among intraoperative measures, the mean lowest MAP was lower and the mean lowest HR and EBL were higher among patients with major complications compared with those without complications (Table 3). The majority of patients in our cohort had a score ranging from 7 to 10 and few patients had scores between 0 and 4 (Fig. 2). The proportion of major complications or death decreased (p < 0.001) with increasing levels of the score (Fig. 2). Stratification by increasing score levels also shows a decreasing trend of likelihood ratios for major complications in arthroplasty. This is consistent with a similar observed trend in general/vascular surgery (Table 4) [18].

Fig. 2
figure 2

Major complication and death rates are shown according to the 10-point Surgical Apgar Score from the operation. The graph shows a monotonic increase in risk across decreasing values of score of categories, indicating the score provides prognostically meaningful information regarding the risk of complications. Below the graph, the absolute number of patients distributed across the score categories is shown. The majority of patients have high score values and only very few have low score values.

Table 4 Likelihood ratios in relation to Surgical Apgar Scores

We identified four independent preoperative predictors (ASA class, BUN, cardiovascular comorbidity, diagnosis of fracture), which together provided important prognostic information regarding postoperative risk (c-statistic = 0.73). The score correlated with preoperative risk predictions (r = − 0.074; p < 0.001), indicating patients with a higher burden of preoperative risk factors tended to have slightly higher intraoperative risk as well (represented by lower scores), but most of the variation in the scores (almost 99%) was not explained by preoperative risk. In four of the five quintiles of preoperative risk, there was a trend of increasing major complication rates with decreasing score values (Table 5). After controlling for preoperative factors independently associated with increased risk of complications (ASA class, BUN, cardiovascular comorbidity, diagnosis of fracture), the odds ratio for the score was 0.746 (95% confidence interval, 0.66–0.84; p < 0.0001). This suggests independent intraoperative contribution to postoperative risk because, for each 1-point decrease in the score, the odds of a major complication increased by 34.0%. Adding the 10-point score as an independent predictor to the preoperative risk model improved model discrimination (c-statistic increased from 0.73 to 0.75).

Table 5 Postoperative outcomes by preoperative risk stratum and Surgical Apgar Score

Discussion

The 10-point Surgical Apgar Score has been developed for providing an estimate of patients’ risk for major complications and death immediately after general and vascular surgery. We assessed whether the Surgical Apgar Score provides accurate risk stratification for hip and knee arthroplasties and whether the score captures the intraoperative variables contributing to postoperative risk for major complications independent of preoperative risk.

There are several limitations to our study. First, unlike the NSQIP in general and vascular surgery, no national, comprehensive, validated outcomes data collection and assessment system exists for hip and knee arthroplasties. We could evaluate only the in-hospital complications reliably available in our hospital’s electronic medical record. However, in-hospital complications reportedly account for greater than 90% of 30-day complications for arthroplasties [14], so this difference alone is unlikely to account for the large discrepancy between predicted and observed complication rates when comparing the score’s performance in our cohort with the original general and vascular surgery cohort. Second, the lack of full-time research nurses abstracting outcome data from full chart reviews of individual patients (as is performed for the NSQIP) precluded our obtaining the same levels of comprehensive abstraction of outcomes. However, unlike minor complications (eg, urinary tract infections), major complications and death can be expected to be documented in discharge summaries, operative notes, and ICD coding [3, 21]. Third, this is a retrospective and not a prospective data collection. For assessing the utility of the score, a prospective collection would be required with followup regarding the potential impact of the score on outcomes.

The score on its own does not provide sufficient discrimination between patients who will or will not have complications to be used as a comprehensive risk stratification tool for clinical triage. Only 6.1% of patients with major complications fall into the high-risk category with a score of 4 or less. Most patients with complications (75.8%) fall into the low-risk group with a score of 7 or more. Thus, for patients undergoing joint arthroplasty, the score cannot supplant physicians’ judgment in determining whether a given patient merits added monitoring or other measures taken when complications are a concern. It achieved a c-statistic of only 0.61, a modest discrimination. This is substantially less compared with the general/vascular surgery validation cohort (c = 0.73) [17]. It is not surprising discrimination was poorer in this population than in the previous validation cohorts, given the important differences in the patients and operations included in the samples. The performance of prognostic models typically degrades somewhat in new populations, particularly populations that differ considerably from the derivation cohort [11].

However, even though the score on its own is not sufficient to be used as a comprehensive risk stratification tool, we found the score captures meaningful prognostic information based on intraoperative variables in patients undergoing an arthroplasty. In combination with other potential preoperative or postoperative risk factors, some of the score’s variables may serve as a foundation for a comprehensive score. Although this arthroplasty cohort was substantially different in many important dimensions from the derivation cohort of general and vascular surgery patients in the NSQIP and had an overall complication rate for arthroplasty less than 1/3 that of the NSQIP cohorts (5.6% versus 22%), similarities in trends of the likelihood ratios support the notion that the relationship between the score and overall relative risk for major complications is similar for arthroplasty compared with general/vascular surgery. Additionally, the score appears to describe risk information of patients based on intraoperative variables that is at least partly independent from preoperative risk factors, indicating the score may capture aspects of the relative performance in the operating room, where more than 2/3 of surgical adverse events occur [9, 10, 20]. Patients with low scores should be recognized to be at an elevated risk for major complications, even patients whose baseline characteristics would suggest they otherwise are at low risk. Characteristics of patients with and without major complications in our cohort are comparable to those reported previously [1416], confirming our cohort is representative. Although not providing comprehensive risk stratification, there are several potential applications for the score in patients undergoing hip and knee arthroplasties, which should be evaluated in additional studies. The score still may provide a simple summary of how an operation went, thereby supporting postoperative communication. So far, there has been no consistent way to communicate with the team managing the patient postoperatively regarding how a patient did during surgery. The data suggest the score does provide a simple, objective, easily communicated indicator of the magnitude of the procedure. The clear association of higher scores with lower complication rates suggests future studies should evaluate whether the score provides a useful target end point for safety initiatives to improve operative management in arthroplasty. As with the Apgar score in obstetrics [1, 4, 6, 7], another setting with low risk of complications and a score that comprises at best moderate discriminatory ability but nonetheless a strong indication of relative risk of early complications, the Surgical Apgar Score could provide practical information. If a reduction in the percentage of patients with poor scores could produce a reduction in the number of complications in a given setting, high score levels may be a useful modifiable target for quality improvement programs. The score could be an important internal benchmark for departments to assess relative operative performance with time.

This simple and pragmatic surgical outcome score captured relevant intraoperative prognostic information, integrating components of patient susceptibility and operative performance distinct from standard risk adjustment, but overall is only modestly predictive of patient-specific complication risk in this relatively low-risk cohort. Future studies should focus on modifications to the score that would improve its predictive performance in this low-risk population. Based on our findings, postoperative risk prediction for hip and knee arthroplasties may require different outcome measures compared with general and vascular surgery. There has been an increased effort in building local or national total joint registries for improving quality and safety. One of the many challenges has been to find the most relevant outcome measures to be included. NSQIP has been proposed as a model for orthopaedics [2]. Our results indicate other outcome measures in addition to or in substitution of the ones collected for the NSQIP may be of importance in arthroplasty.