Introduction

Gastric cancer is the fourth most common cancer worldwide. In Japan, the newest reported age-standardized mortality rates are 20.5 % in males and 8.0 % in females by the International Agency for Research on Cancer. In Korea, these numbers are relatively lower: 13.8 % in males and 7.3 % in females [1]. In China, the mortality rate due to gastric cancer is 40.8/100,000 and 18.6/100,000 in males and females, respectively.

The decision to perform surgery or administer chemoradiotherapy is based upon cancer stage at presentation; comprehensive treatment based on surgery is considered the standard. Outcomes depend not only on surgical expertise, but also on patient physiology, tumor pathology, and nutritional status. Therefore, early prognosis is essential. Predicting morbidity and mortality is critical in decision making with regards to timing and type of surgery.

Several scoring systems currently exist for prediction of morbidity and mortality throughout surgical disciplines. Of these, APACHE II, POSSUM, and P-POSSUM have been shown to be the most reliable and useful. APACHE II has been validated in several clinical trials [2], and estimates ICU mortality based upon laboratory values and patient characteristics, including acute and chronic disease. POSSUM, which was introduced in 1991 [3], was found to overestimate mortality risk; in 1996, P-POSSUM was developed, an equation which used similar physiological and operative factors but incorporated linear analysis [4].

There is an abundance of studies that address surgical morbidity and mortality. However, these data are inconclusive for patients of Asia–Pacific descent; racial and regional differences may exist. For example, when comparing English and American patients, the mortality rate predicted by POSSUM was higher than the actual mortality rate in American patients [5]. To date, few authors have evaluated gastric cancer in China using all three scoring systems. We analyzed 851 patients who presented with gastric cancer between 1991 and 2011, and found P-POSSUM to be most accurate. In addition, type of surgery independently predicts morbidity and mortality.

Patients and methods

Data collection and processing

Through an independent review by two groups of researchers specializing in gastric cancer, all clinical data was collected from 851 patients (583 male, 268 female, ratio 2.18:1) who presented between 1991 and 2011 at Huashan Hospital, China. Patients who were not deemed to be surgical candidates, or those who declined surgery, were excluded from the analysis.

All patients underwent gastric resection and pathological staging according to current NCCN guidelines by our institutional Department of Pathology. Patient demographic data are presented in Table 1, and were obtained from patient medical records and nursing notes. Operative data were obtained from operative, pathological, and anesthetic records, with 30-day morbidity and mortality considered to be standard entry information. All data was entered in a spreadsheet format using Microsoft Excel 2007. Statistical analysis was performed using STATA version 11.0.

Table 1 Data classification method, and the number of each classification

Scoring systems

APACHE II consists of three parts: acute physiology score (APS), age, and chronic physiology score (CPS). APS includes 12 parameters, ranging from 0 to 4 points, with a total score of 0–60 points. Age scores range from 0 to 6 points. CPS scores range from 2 to 5 points. In total, APACHE ranges from 0 to 71 points.

APACHE II uses the following formula to calculate the risk of death (R1): ln(R1/1 − R1) = −3.517 + (APACHE II score × 0.146) + (0.603, if an emergency operation) + (CPS).

The POSSUM score has two main components, physiological and operative severity, which are calculated to predict postoperative prognosis. We used POSSUM as originally defined by Copeland et al. [3], and P-POSSUM as defined by Whiteley et al. [4].

The following equations were used to calculate morbidity and mortality for POSSUM (R2 = mortality, R3 = morbidity).

$$ {\text{Log}}_{\text{e}} \frac{R2}{(1 - R2)} = - 7.04 + (0.13 \times {\text{physiological\_score}}) + (0.16 \times {\text{operative\_severity\_score}}) $$
$$ {\text{Log}}_{\text{e}} \frac{R3}{(1 - R3)} = - 5.91 + (0.16 \times {\text{physiological\_score}}) + (0.19 \times {\text{operative\_severity\_score}}) $$

We used the following equation to calculate mortality for P-POSSUM (R4 = risk of mortality).

$$ {\text{Log}}_{\text{e}} \frac{R4}{(1 - R4)} = - 9.065 + (0.1692 \times {\text{physiological\_score}}) + (0.1550 \times {\text{operative\_severity\_score}}). $$

Evaluation of scoring systems

We calculated each patient’s score according to each of the three described systems. We generated a mortality observed/expected (O/E) ratio and morbidity O/E ratio for each scoring system. Receiver operating characteristic (ROC) curves were generate to demonstrate the sensitivity and specificity of morbidity as predicted by POSSUM. The area under the curve determines accuracy. Regression analysis was used to evaluate the effectiveness of each scoring system.

Statistical methods

Student’s t test was used to compare categorical and continuous variables, and to analyze both observed and predicted outcome measures for each scoring system. A p value of less than 0.05 was considered statistically significant. Statistical analysis was performed using STATA software, version 11.0.

Results

Mortality rates as predicted by APACHE II, POSSUM, and P-POSSUM scoring systems

Both APACHE II and POSSUM estimations of preoperative mortality risk were generally high. However, POSSUM had an acceptable specificity (false-negative) rate. P-POSSUM overestimated mortality risk, but was the most accurate of the three systems, with a false-negative rate of 0 (Table 2).

Table 2 O/E ratios of three scoring systems for postoperative mortality

ROC analysis

We calculated the scores for each patient with these three forecasting methods from the patients’ information we collected. We compared the value from the prediction formula with the actual value. Using regression analysis, the three scoring systems were found to differ with regard to mortality prediction (Fig. 1a–c). P-POSSUM was the most effective, followed by POSSUM and APACHE II.

Fig. 1
figure 1

Predicting mortality with three scoring systems. a Data analyzed by APACHE II; the AUC area is 0.8708. b Data analyzed by POSSUM; the AUC area is 0.9164. c Data analyzed by P-POSSUM; the AUC area is 0.9253. d Predicting morbidity of 851 patients with P-POSSUM. d The AUC area is 0.6064. Confidence intervals are listed

POSSUM predicted morbidity rates

POSSUM was found to have a higher predicted than actual morbidity risk, although it was more accurate in patients with N3 disease or who required combined organ resection. However, in patients aged under 50, those with N1 and N2 disease, Stage III patients, and those who underwent total gastrectomy and exploratory laparotomy, POSSUM exaggerated morbidity risk (Fig. 1d).

Logistic regression analysis

Mortality

We established the following regression equation to predict mortality. We tested the goodness of fit of the equation with the method of Hosmer–Lemeshow. The result showed the p value was 0.1315, indicating a high goodness of fit. As the correlation between the independent variables, we chose the stepwise regression backward elimination method for variable selection. Age and surgical approach are the last two independent variables in the model. Defining whether death occurred as the dependent variable, we presented logistic regression with gender, age, and surgical methods as independent variables. As there was no case of death in the distal gastrectomy group in all the three age groups and few cases of death in the under-50 age group, we first selected the over-50 age groups, as well as the other four surgical approach method groups for analysis. We considered the main effects of the independent variables, as there was no interaction between age and surgical approach. We confirmed that gender had no effect on whether death occurred (p = 0.45). If gender and surgical approach remain unchanged, the risk of death in the over-70 age group was 8.89 (95 % confidence interval 3.21, 24.63) times higher than in the 50–70-year-old group. If gender and age remain unchanged, the death risk from combined organ resection was 3.79 (95 % confidence interval 1.10, 13.04) times higher than the risk of distal gastrectomy, and the death risk from exploratory laparotomy was 6.73 (95 % confidence interval 2.31, 19.63) times higher than the risk from distal gastrectomy. There was no significant difference between the other surgical methods. We then confirmed there was no significant difference in the risk of death between the different surgery methods in the over-50 age group (Fig. 2a).

Fig. 2
figure 2

ROC curve of mortality and morbidity established by logistic regression analysis. a AUC area is 0.8301. Cutoff point is 0.0620. Positive likelihood ratio (LR+) is 5.4751. Negative likelihood ratio (LR−) is 0.3669. b AUC area is 0.6533. Cutoff point is 0.3155. LR+ is 1.7275. LR− is 0.6703

$$ P^{\prime} = 0.0 8 1 3\times {\text{age}} - 0. 4 2 3 9\times T + 0. 2 8 40 \times N - 1. 1 9 3 5\times M + 0. 3 9 7 8\times {\text{stage}} + 0. 4 5 5 5\times {\text{po}} + 0. 6 9 5 5\times {\text{oper}} - 10. 8 9 5 8 $$
$$ R^{\prime} = 1/( 1+ {\text{e(}} - P^{\prime} )). $$

Morbidity

We established the following regression equation to predict morbidity. We tested the goodness of fit of the equation with the method of Hosmer–Lemeshow. The result showed a p value of 0.1064, indicating a high goodness of fit. We chose stepwise regression backward elimination method for variable selection. Age and surgical approach are the last two independent variables in the model. If surgical approach remained unchanged, the risk of complication in the over-70 age group was 3.91 (95 % confidence interval 2.43, 6.30) times higher than in the under-50 age group, 2.22 (95 % confidence interval 1.57, 3.15) times higher than in the 50- to 70-year-old group. If the surgical approach is defined as a variable, the risk of complication in the exploratory laparotomy group was 3.12 (95 % confidence interval 1.29, 7.56) times lower than in the proximal gastrectomy group, 2.20 (95 % confidence interval 1.28, 3.80) times lower than in the distal gastrectomy group, 2.37 (95 % confidence interval 1.25, 4.49) times lower than in the total gastrectomy group, and 2.36 (95 % confidence interval 1.22, 4.58) times lower than in the combined organ resection group. There was no significant difference between the other surgical methods (Fig. 2b).

$$ P^{\prime} = 0.0405\times {\text{age}} - 0. 3206\times T + 0. 0318 \times N - 0. 2392\times M + 0. 3463\times {\text{stage}} + 0.2054\times {\text{po}} + 0. 0932\times {\text{oper}} - 3.0223 $$
$$ R^{\prime} = 1/( 1+ {\text{e(}} - P^{\prime} )). $$

Discussion

Audit is becoming increasingly more important in clinical practice and has gained attention as a tool for clinical and scientific research. APACHE II and POSSUM are audit tools used in general surgery and several other surgical fields. In general surgery, prediction of colorectal cancer risk is the focus of research. However, few studies exist that predict risk in gastric cancer patients, especially in the Chinese and Asian populations. Various risk prediction models may provide a comparative platform that would make it possible to compare operative risk in different diseases, from different hospitals, and in different regions.

APACHE II is based upon objective physiological parameters and is less affected by treatment. Several studies report that APACHE II is closely related to disease severity; the higher the score, the more severe the disease and the higher the risk of death. An APACHE II score greater than 20 is associated with a 100 % mortality rate [6]; analysis of this research provides an overall mortality distribution.

APACHE II is quite limited, as it was designed for the ICU and is more predictive of group versus individual effect. Van Le et al. [7] confirmed that the sensitivity of the APACHE II score in the individual was only 54 %, although it was used to assess 45 gynecological patients in the ICU. Other authors considered APACHE II to be ineffective in evaluating patients under special circumstances, such as trauma, TPN, acute myocardial infarction, and severe congestive heart failure [8, 9]. In these situations, mortality was always estimated to be higher than actual when the score was low. In contrast, the predicted mortality rate was always lower than actual when the score is high. As these articles reported, APACHE II parameters were used within 24 h after ICU admission; Knaus et al. [2] believed that parameters sampled during ICU admission would be more meaningful. Further scholars think that dynamic ratings (assessing patients over several days) may be more effective in predicting mortality. APACHE II is a general rating system that does not consider pathology; however, pathology has a great influence on the general health of the patient. APACHE II was designed by Western scientists; thus, some regional differences may exist when compared to Asian countries.

Most scoring systems reflect vital signs and vital organ function. Certain indicators, such as the Glasgow Coma Scale, hold no special significance for patients with gastric cancer; this type of surgery is generally elective and those with poor vital signs will not be subjected to surgery. Conversely, chronic disease that may affect prognosis and physiological indicators are not included in the statistics.

In many European countries, POSSUM, one of the most important common scoring methods, has become routine in preoperative evaluation of patients. It not only evaluates the risk of surgery, but also provides an objective basis for patients to choose surgical options and strengthen intervention measures in high-risk patients. Lamb used POSSUM to demonstrate that controlling the extent of surgery, and balancing the risk of surgery and extent of dissection, reduced mortality in 180 gastric cancer patients [10]. Hartley and Sagar [11] and Sah et al. [12] showed the excellent predictive value of POSSUM through a retrospective analysis of gastrointestinal surgery patients. Similar results were shown for POSSUM by Mosquera et al. (454 vascular surgery patients) [13], Tambyraja et al. (laparoscopic cholecystectomy) [14], and Bennett-Guerrero et al. (liver transplantation). However, in certain situations, this system can be problematic, such as overpredicting morbidity in low-risk groups [15]. Das et al. [16] analyzed 468 gynecological tumors using POSSUM, and found that predictions of operative mortality and the actual outcomes were inconsistent. Copeland et al. [3] found that the results of POSSUM were satisfactory for general surgery, but needed improvement for use in orthopedic surgery. In this case, when the POSSUM score was less than 40 %, the predicted mortality rate was greater than 100 % of the actual outcome. However, when the POSSUM score was greater than 60 %, the number of deaths was underestimated. The false-negative rate of POSSUM was low. The actual number of deaths in each group was within the range of predicted values. P-POSSUM has the best predictive value of all three scoring methods, and is an improvement over the POSSUM scoring system. P-POSSUM has satisfactory results not only in the general population, but also among specific cohorts. Prytherch et al. [17] found, through the analysis of 10,000 general surgery patients, that the P-POSSUM scoring system’s ability to accurately predict mortality was better than POSSUM. Jones et al. then did a comparative study of surgical risk in individual surgical ICU patients, using POSSUM and APACHE II, and the AUC curve for statistical analysis (AUC, POSSUM = 0.75; AUC, POSSUM morbidity = 0.82; AUC, APACHE II = 0.54) [18]. Thus, the ability of POSSUM to accurately predict outcome is significantly higher than APACHE II. De Cassia Braga Ribiero and Kowalski analyzed postoperative morbidity and mortality after pharyngeal cancer surgery, using APACHE II, POSSUM, and the ASA scoring systems [19]. The predictive value was graded as follows: POSSUM > APACHE II > ASA. In this case, POSSUM and P-POSSUM were better able to predict mortality, but POSSUM was weak in predicting complications.

In statistical analysis, POSSUM has several drawbacks. Electrocardiogram grades do not accurately reflect patient risk. Because of the increased likelihood of ventricular arrhythmias, myocardial ischemia impacts both morbidity and mortality rates, but these are not reflected in the ECG grading standards in the POSSUM scoring system. Subjective elements easily cause bias. Intraoperative blood loss and abdominal contamination cannot be precisely determined. The surgical acuity score therefore has too many subjective factors. Some patients have several physiological indicators corrected before surgery, such as anemia, malnutrition, and electrolyte balance disorders. Some authors therefore suggest that risk assessment for gastrointestinal surgery should be conducted at two points: 24 h after admission, and postoperative day one. The first assessment is to determine whether the patient will tolerate surgery and to determine the timing and type of operation. The second assessment is to predict morbidity and mortality and to guide further treatment.

With logistic regression analysis, we found only age and type of surgery to be independent factors of patient death. Similarly, age, characteristics of the primary tumor, and type of surgery are independent predictors of complications. Although TNM is an independent factor affecting prognosis in gastric cancer, survival decreases with increased TNM stage; we did not find TNM staging to be an independent predictor of postoperative mortality.

The selection of type of operation that may carry an increased risk of morbidity and mortality is a critical issue for surgeons. Westerners often have associated obesity, circulatory, and pulmonary comorbidities; thus, European and American surgeons seldom perform D2 radical gastric cancer resection. In contrast, D2 radical gastric cancer resection has long been considered a standard surgical procedure in Japan. Combined gastrectomy often includes the spleen, pancreas, colon, and duodenum. Although some of these radical procedures do not demonstrate evidence of increased long-term survival, a more aggressive operation may significantly impact postoperative morbidity and mortality rates. Therefore, surgeons should aim for a radical cure with as much of a controllable risk as possible. On the other hand, as combined organ resection does not carry a significant mortality risk in patients with advanced disease, if resectable, surgeons should aim for as complete an operation as possible, reducing tumor burden.

Conclusion

As gastrointestinal surgery advances, selecting a personalized treatment based on patient condition is becoming increasingly important. Preoperative assessment for Chinese patients needs improvement. The impact of gastric cancer on patients with perioperative risk needs to be balanced with current scoring systems so that an appropriate formula can be used for Chinese patients. Clinicians may develop individualized treatment plans according to these formulas, selecting the appropriate surgical approach, enhancing preoperative organ protection, and optimizing postoperative treatment.