Acute appendicitis (AA) is a common surgical condition of the abdomen, the prompt diagnosis of which is rewarded by a marked decrease in morbidity and mortality [1]. Although the decision to explore a patient with suspected AA is based mainly on disease history and physical findings, the clinical presentation is seldom typical. Therefore diagnostic errors are common, resulting in a median incidence of perforation of 20% and a negative laparotomy rate ranging from 2% to 30% [1].

During the last 20 years, there has been a growing trend toward the use of formal probabilistic reasoning or quantitative data as a guide to clinical decision making. In this respect, several scoring systems, computer-based models, and algorithms [212] have been developed for supporting the diagnosis of AA on the basis of grading medical history, clinical symptoms and signs, and indicators of inflammatory response. According to initial evaluation reports, these decision tools are cost-effective and may provide considerable diagnostic aids to physicians [13]. Nevertheless, the aforementioned models have not been routinely applied in general practice because they have failed to achieve adequate accuracy in validation studies [1417].

Accumulating evidence has suggested that US in experienced hands improves diagnostic accuracy in cases of suspected AA [18, 19]. Thus, sonographic imaging has been proposed as a diagnostic tool even in patients with a clinically high probability of AA, because it accurately depicts a high percentage of normal appendices and alternative diagnoses [20]. However, these findings do not imply that surgeons may not apply their clinical acumen to the management of subjects with suspected AA, inasmuch as series with false-negative sonographic rates of up to 24% have been reported [21]. Furthermore, only scant data exist on the potential combination of US findings with clinical and laboratory variables as an integrated decision tool [22].

The aims of the present study were (1) to develop a simple and reliable scoring system that would incorporate US assessment and particular elements of clinical evaluation and laboratory investigation to provide high diagnostic accuracy in patients with suspected AA and (2) to evaluate the performance of the derived classification rule as compared to that of previously proposed models in a independent database of subjects with suspected AA.

Patients and Methods

The present investigation included overall 504 subjects with suspected AA who were studied in two distinct phases: (1) an internal development study (303 patients) and (2) an external validation study (201 subjects). Both studies were observational and no intervention was done except for the addition of formalized data collection.

Internal Development Study

The internal study was a prospective analysis conducted between July 1998 and July 2001. Within 1 hour of the clinical and laboratory assessment, all patients admitted with clinically suspected AA underwent sonographic examination of the appendix and the abdomen by a staff radiologist who was blinded to the results of the physical examination and the blood tests, but not to the patient’s symptoms. Ultrasound was performed with commercially available high-quality equipment (HDI 3000 unit; Advanced Technology Laboratories, Walpole, MA) with 2–5 MHz curved-array and 5–10 MHz linear-array transducers. Well-established ultrasonographic criteria were applied to discriminate an acutely inflamed appendix from a normal one [19]. All female patients had pelvic examinations. The diagnosis of AA was made exclusively on histopathological grounds by the local pathologist according to previously described standardized criteria [16].

External Validation Study

From August 2001 to August 2003 the score derived from the internal study was applied to an independent database including the next consecutive patients hospitalized for suspected AA. Subsequently, the performance of the score in the above database was compared to that of 11 previously proposed diagnostic scores for AA, which were also calculated by using data from the population of the external study [212]. The selection criteria regarding the aforementioned diagnostic scores for AA were (1) development of each score from patients presenting with acute abdominal pain, (2) previous validation in at least one prospective study [15], and (3) feasibility of each score calculation (namely no missing variables) on the basis of the data prospectively collected in our external validation study by using a structured form that included a standardized questionnaire. Because the goal of the present study was to compare the new model with the numerous previous ones, application of the new score to the external study in order to reduce the negative appendectomy rate was not possible without biasing the results. Hence, no score-based intervention took place, and the decision to operate or not was left to the judgment of the senior surgeon, who was not aware of the conclusion of each model for every individual subject. The diagnosis of “non-appendicitis condition” in non-operated participants of both the internal and the external study was supported by telephone surveillance for at least 1 year (median 27 [12–60] months).

Statistical Analysis

Statistical analysis was performed using the Statistical Package for the Social Sciences software (SPSS Inc, Chicago, IL, release 10.0). Acute appendicitis at operation was used as the end point in the internal study. Univariate correlations between the presence of the aforementioned end point and clinical or laboratory features were evaluated with the chi-squared test or Fischer’s exact test, as appropriate for categorical data, and with Student’s t-test or one-way analysis of variance (ANOVA) for continuous variables. A forward stepwise selection procedure, with entry and removal criteria of p = 0.05 and p = 0.10, respectively, was used to identify independent predictors of AA. Ninety-five percent confidence intervals (95% CIs) were calculated for each comparison. All tests of significance were two-tailed, and a p value less than 0.05 was considered to be significant. In the external study, the diagnostic performance of each scoring system was tested to define risk groups, reflecting the varying likelihood of AA in an independent population. The Cohen’s kappa statistic was calculated for assessing the agreement between scores. Areas under the receiver-operating characteristic (ROC) curves were used to describe the diagnostic performance of each score. The Hosmer-Lemeshow test was used to assess the fit of the models.

Results

Internal Development Study

Among 318 patients admitted with suspected appendicitis, complete data were available for 303 subjects (170 males, 56.1%) who were finally included in the analysis. Participants’ mean age was 28.3 ± 13.3 years, 161 patients (52.1%) went on to surgery and 130 (42.9%) had AA at operation. Non-operated subjects were assumed not to have AA, because none of them developed appendicitis during follow-up. Table 1 shows patient characteristics as well as univariate correlates of AA at operation. The final diagnosis for all operated and non-operated subjects in the internal study is presented in Table 2. The negative appendectomy rate was 19.2% (31 out of 161 operated patients). Four independent predictors of the presence of AA were identified in the logistic regression analysis (Table 3). The coefficients (parameter estimates) of the above four factors multiplied by 2 and rounded to the nearest integer, allowed for a simpler re-expression of the final regression model as an integer-based scoring system, which assigned a weight (point) to each predictor and summed the weights of the predictors that were present for a subject: [number of points = 6 for US positive for AA + 4 for tenderness in right lower quadrant + 3 for rebound tenderness + 2 for leukocyte count > 12,000/μl]. None of the 22 patients (7.3% of total) who were in the subgroup with the lowest score (0–4 points) had AA, whereas in 126 (96.2%) of the patients with the highest score (8–15 points; n = 131 [43.2% of total]), AA was the final diagnosis. Nevertheless, the proportion of subjects with AA among patients with moderate scores (5–7 points; n = 150 [49.5% of total]) was very small (4 out of 150, 2.7%). Thus, using the cut-off of ≥ 8 points for the diagnosis of AA in the internal study, a very high probability of AA would have been assigned to subjects with 8–15 points (96.2%, 126/131) as opposed to the very low probability for patients with 0–7 points (2.3%, 4/172).

Table 1 Demographic, clinical, and laboratory characteristics of 303 subjects with suspected appendicitis in the internal study and univariate correlates of acute appendicitis at operation (OR, odds ratio; 95% CI, 95% confidence intervals).
Table 2 Final diagnosis for 303 subjects with suspected appendicitis in the internal study.
Table 3 Multivariate analysis of predictors of acute appendicitis at operation in the internal study (OR, odds ratio; 95% CI, 95% confidence intervals).

External Validation Study

The above diagnostic score was calculated for the next 201 patients (105 [52.2%] males, mean age 28.7 ± 11.9 years [range; 15–79 years]) hospitalized for suspected AA. Among the above subjects, 109 (54.2%) went on to surgery and 87 (43.3%) had AA at operation. No significant difference was observed between the populations of the internal study and the external study in die overall frequency of the four above-mentioned independent predictors, as well as in terms of the remaining clinical and demographic characteristics. The application of the new classification tool to the external database showed 96.5% of subjects with 8–15 points to have AA (Table 4). The proposed dignostic model yielded a score of < 8 points for all 92 non-operated patients in the external study. The level of agreement of the proposed score as estimated by the kappa statistic was high with Eskelinen and Ohmann scores and moderate to fair with the remaining ones (Table 5). The present model exceeded noticeably the previous ones in diagnostic accuracy (Table 5) as well as in discriminatory capacity as expressed by area under the curve (AUC) (p < 0.001; Table 6, Fig. 1).

Table 4 Performance of the proposed diagnostic score in the external validation study.
Table 5 Comparison of the proposed score with the previous ones.
Table 6 Discriminatory power of the proposed score as well as of 11 previous scores for diagnosis of acute appendicitis expressed as areas under the receiver-operating characteristic curves (95% CI, 95% confidence intervals:).
Figure 1
figure 1

Receiver-operating characteristic curves plotted on the basis of the proposed score as well as of 11 previous scores for diagnosis of acute appendicitis.

Discussion

The model developed in the present study combines the diagnostic value of four variables: namely two well-recognized clinical features of AA (tenderness in the right lower quadrant and rebound tenderness) [1], US imaging, and leukocytosis, the latter reflecting the inflammatory response. The prominence of the aforementioned factors as independent correlates of AA corroborates previous reports, which have shown scores not including the above clinical variables and leukocytosis to provide poorer discrimination [1, 15]. With regard to the varied weighting of the four multivariate predictors, a positive US finding surpassed any other factor by introducing an at least 5.5-fold increase to the probability of AA as suggested by 95% CIs (Table 3).

According to the proposed threshold of ≥ 8 points, if the appendix is sonographically shown to be inflamed, the presence of at least one additional factor is required to establish AA, whereas in the absence of US demonstrating AA, all three remaining variables are necessary for the diagnosis. For example, the above model would suggest the diagnosis of AA in a patient with leukocytosis and a positive US finding (total score 8 points), even if rebound or right lower quadrant tenderness were lacking. The application of the new system to the external database yielded an impressive diagnostic accuracy of 96.5%, which exceeded noticeably the performance of previous scores, whereas the comparison of the corresponding AUCs showed a clearly greater discriminatory capacity of the present score (Table 6, Fig. 1), 95% CIs excluding an AUC for the proposed model smaller than 0.93. The superiority of the new score could be attributed to the incorporation of an imaging modality in a formal decision tool for AA, which is the novel diagnostic procedure introduced in the present study.

Although sonographic imaging of the abdomen has been established as a useful tool in diagnosis of AA being of particular value in patients with atypical presentation [23], its accuracy has been doubted in more recent large studies and meta-analyses [18, 19, 21, 2426]. In this respect, it has been demonstrated that, when US is used as the determining factor for operative therapy, it cannot be relied on to the exclusion of the surgeon’s careful and repeated evaluation [21]. Furthermore, a prospective multicenter observational trial on 2280 patients with acute abdominal pain reported no correlation between the sonographic findings of the appendix and the diagnostic accuracy of the clinician, the rate of negative appendectomy, and the perforation rates, thus suggesting no clear benefit of US scanning of the appendix in the routine clinical setting [19]. In addition, sonography failed to improve the diagnostic accuracy or the negative appendectomy rate and was even found to delay surgical consultation and appendectomy in a large study that included 766 subjects [24]. Nevertheless, it has been shown that US is unnecessary when there is a high degree of clinical suspicion as expressed by a positive Alvarado score, whereas the additional information provided by US improves diagnostic accuracy in the case of a negative or equivocal Alvarado score [25]. Moreover, a meta-analysis published in the middle 1990s suggested that US is most helpful in patients with an indeterminate probability of the disease after the initial evaluation and should not be used to exclude AA in subjects with classic signs and symptoms because of the underlying relatively high false-negative rate [18]. Finally, a more recent meta-analysis on the value of US in the diagnosis of AA revealed disappointing results in multi-center trials, suggesting that the adequate performance of sonography in single-center studies may not reflect surgical everyday life [26].

Ultrasound is rapid, noninvasive, inexpensive, and requires no patient preparation or contrast material administration [23]. Because it involves no ionizing radiation and excels in the depiction of acute gynecologic conditions, it is recommended as the initial imaging study in children [27] and in women [28], especially during pregnancy [29]. Yet, the limitations of US include its reduced accuracy in obese or muscular subjects, as well as in patients with perforated AA (approximately 50%) compared to that observed in nonperforated AA (80%) [23]. Furthermore, US is known to be highly operator-dependent, the learning curve required to develop the technique for sonographically scanning the right lower quadrant is considerable, and there are many interpretive pitfalls to be avoided [23]. It has been shown, however, that even if radiology residents or inexperienced surgeons conduct the imaging, the accuracy of US is not diminished [30, 31]. In any case, although the criteria for the US-based diagnosis of AA are well-established and reliable, the inexperienced examiner, working with poor equipment and/or technique, will provide suboptimal results, and this possibility should be taken into account when incorporating sonographic criteria in the diagnostic pattern.

The use of US in the setting of suspected AA might be questioned in an era when appendiceal computed tomography (CT) has been demonstrated to provide an accuracy rate as high as 98% in the diagnosis of AA, leading to improved patient care and reduced use of hospital resources [32]. Moreover, CT has repeatedly been shown to exhibit superior discriminatory capacity compared to US in both adults and adolescents with suspected AA [3335], suggesting that the proposed classification system may not apply to geographical areas where CT scanning is readily available on a 24-hour basis. In this study, the inability to routinely perform CT scanning may account to a great extent for the relatively high false positive rate of approximately 20%. This number of false positive diagnoses would be unacceptable in most Westernized nations, where the appropriate CT utilization in community hospitals has been shown to reduce the negative appendectomy rate from 14%–20% to 2%–7% [3638]. Nevertheless, because many portions of the world health community may still not be able to afford CT scanning but can afford US equipment, the combined systematic implementation of sonographic evaluation and clinical acumen could be valuable as suggested by the present study.

Because the simultaneous application of the preexisting models and the new score to the same database has favored the latter, the respective clinical implications should be further evaluated. A prospective interventional large-scale evaluation in different clinical environments, in an adequate controlled study comparing a baseline phase without scoring to a subsequent phase with scoring would probably be the optimal approach [15, 16]. To reduce bias with such a design, uniform data collection should be carried out according to constant definitions, with standardized performance criteria used to ensure objective evaluation [16].

Any diagnostic support for AA should be warmly welcomed if it has been proven to be clinically valuable, because unacceptably high negative appendectomy and perforation rates are still reported in many portions of the world health community. However, apart from being familiar with elements not included in a quantitative model, physicians may be able to provide superior imputations of missing data for an individual patient and to integrate the diagnostic estimate as part of their overall patient assessment. Therefore, including the proposed score in the diagnostic procedure is worth trying and may enhance a surgeons discriminatory capacity, under the prerequisite that it will be considered as an adjunct in decision making that cannot supplant careful surgical judgment.