Introduction

Appendicitis is one of the most common acute surgical illnesses with a lifetime prevalence of one in seven [1]. It continues to be clinically challenging to diagnose as it mimics a variety of other pathologies, especially in females [1]. Diagnosis is usually based on the clinical history, examination, correlated with laboratory and imaging investigations. The final diagnosis may require diagnostic laparoscopy, which itself is not without risk.

Clinical prediction rules (CPRs) are one of the most commonly described tools used to aid the diagnosis of appendicitis. CPRs are derived from systematic clinical observations and aim to reduce uncertainty by standardising the collection and interpretation of clinical data [2, 3]. They have been shown to provide a more objective method of assessment and standardisation of care for patients with suspected appendicitis, thereby reducing the number of unnecessary operations and patient exposure to radiation [4]. Although a plethora of CPRs exist for the diagnosis of appendicitis, it is unclear which of these performs most reliably.

The aim of this systematic review was to identify all current CPRs for the diagnosis of appendicitis in adults and assess their performance.

Methods

Search strategy

This study was completed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [5]. A comprehensive literature search was performed in MEDLINE, EMBASE, Pubmed and Cochrane Central Register of Controlled Trials databases from inception to February 2016. The search strategy is outlined in Table 1. Studies were restricted to English language and humans only. The reference list of all included and relevant review articles were also searched to identify further potentially eligible manuscripts.

Table 1 Search terms used

Inclusion and exclusion criteria

Only studies that derived or validated the impact of a CPR for use in adults presenting with right lower quadrant (RLQ) pain, right iliac fossa (RIF) pain or abdominal pain suspicious of appendicitis were included. For the purposes of this study, a CPR was defined as one that [2, 3, 6];

  • Had three or more predictive variables obtained from the history, physical exam and simple diagnostic tests

  • Provided a probability of an outcome or suggested a diagnostic/therapeutic course of action.

  • Was not a decision analysis, decision tree or practice guideline.

Both CPR derivation and validation studies were included. A derivation study was defined as a study that described the method of how a new CPR was formed and explained how it should be applied in a clinical setting. A validation study assessed performance of an existing CPR by ascertaining the sensitivity, specificity and/or AUC. If derivation studies included an internal validation component, the validation component was excluded from the validation study analysis due to a high risk of potential bias [2].

Exclusion criteria for derivation studies

When assessing articles which derived a CPR, studies that modified an existing scoring system in order to generate a new scoring system were included if the new parameters and cut-off values were clearly defined. There was no restriction on study design. Scores which were derived for use solely in paediatric, elderly, pregnant or single gender populations and those that did not assess the primary outcomes of appendicitis versus non-appendicitis and/or required the use of neural networks were excluded.

Exclusion criteria for validation studies

Studies that validated CPRs in elderly populations, a single gender only or included patients younger than 14 years, were excluded. Studies that looked at a subset of the scoring system or only patients that had imaging were also excluded. Three studies that did not state the age of the participants were also excluded. Studies that included patients younger than 14 years of age with a separate analysis for adults were included.

Selection of studies

The initial search, title and abstract screen were performed independently by MK and CH. Any discrepancy between the two reviewers was discussed with senior author AM. A total of 224 articles were identified as relevant and underwent full text review by authors MK, CH, ML, WM and LS.

Data extraction and statistical analysis

Derivation studies

Data from studies describing derivation of a CPR were extracted using a standard pro forma. Study characteristics, derivation methodology, scoring systems characteristics (e.g. use of weighting, positive versus negative scoring) and variables comprising the CPR were recorded for each study.

Validation studies

Extracted data from CPR validation studies were also extracted using a standard proforma. These included study design, results obtained for sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), likelihood ratios and AUCs values from receiver operating curve (ROC) analysis.

When more than two cut-off values were evaluated for the prediction of high risk of having appendicitis, only the cut-off recommended in the original derivation paper was used for analysis. When sensitivity and specificity were not calculated in the validation studies, these were calculated from the data available using a two-by-two table by author MK and confirmed by YT. Forrest plot confidence intervals (CI) were calculated using the variance method for all studies to minimise bias [7].

Assessment of methodological quality of validation studies

The quality of included validation studies was assessed and scored using 15 pre-defined criteria by Wasson et al. (Table 2) [3]. These criteria were specifically designed to assess articles describing clinical prediction rules.

Table 2 Quality assessment criteria for validation studies based on previously defined criteria by Wasson et al. [3]

Results

Study selection

The initial database search identified 7696 titles, and a further 56 identified through the manual search. Of these, 4398 were potentially relevant after removal of duplicates and further screening. Following abstract review 257 papers met criteria for full text review. Of these, 12 papers describing derivation of CRPs and 22 describing validation were included. The PRISMA flow diagram is presented in Fig. 1 [5].

Fig. 1
figure 1

Flow diagram of showing systematic inclusion of papers

Derivation studies

Characteristics of CPRs derived for use in adults with suspected appendicitis demonstrated significant heterogeneity in both study population and methodology (Table 3). Among the discrepancies in methodology was the variation in statistical analyses. Three studies used univariate analysis, while seven studies used multivariate analysis (Table 3) [816].

Table 3 Characteristics of studies and the clinical prediction rules from derivation studies

The most commonly incorporated variable was the white cell count, which appeared in all 12 studies (Table 4) [818]. Temperature, rebound tenderness and migratory pain were the next most common across all studies (Table 4) [811, 1618]. Studies that used multivariate analysis identified gender, elevated C-reactive protein, RIF pain, neutrophilia, vomiting and signs of peritonism (guarding, rigidity) as likely variables [9, 10, 1316]. Rectal tenderness, diarrhoea and Rovsing’s sign were the least commonly used variables and appeared only in CPRs that used univariate analysis [11, 12, 18].

Table 4 Variables incorporated within CPRs

Validation studies

The 22 included validation studies demonstrated heterogeneity with respect to study population, study design and cut-off values evaluated (Table 5). Two of the 22 studies only had AUC values available for the adult population. A scatter plot of all sensitivity and specificity values adjusted for sample size is shown in Fig. 2. A Forrest plot could only be generated for sensitivity as the number of true negatives was unable to be calculated from the majority of the studies due to incomplete follow-up of discharged patients (Fig. 3). As CIs displayed in the Forrest plot were calculated using the variance method, the values presented in Fig. 3 may differ to those published in the original studies due to different calculation methods. The studies published by Scott et al. (year), Erdem et al. (year) do not have CIs calculated as the sensitivity and sample size values were too similar.

Table 5 Results for sensitivity, specificity, quality of external validation studies (ordered by score) and pragmatic utility of score
Fig. 2
figure 2

Dot plot of sensitivity and specificity adjusted for sample size of each population. Different colours and shapes have been used to differentiate populations

Fig. 3
figure 3

Forrest plot for sensitivity. A variance calculation has been used for unbiased estimation of the confidence interval for each study. This may not be the same as those published in the original article as they may have used a different method. For two of the included studies a confidence interval could not be determined due to the sample size and sensitivity being equivalent

The majority of studies had a quality score between six and eight, while only six studies scored ten or more out of fifteen (Table 5). Of these, the two highest quality studies validated the acute inflammatory response (AIR) and Lintula scores [19, 20].

A general trend demonstrated that at higher cut-off values, the specificity of scoring systems improved but at the expense of the sensitivity. Clinically, this means CPRs with high cut-off values are better for ruling out a diagnosis of appendicitis due to the good positive predictive value (Table 5; Figs. 2, 3). This was especially apparent in the Alvarado and AIR scores.

The most commonly validated CPR was the Alvarado score, followed by the Kalan’s modified Alvarado score (Figs. 2, 3) [2137]. The average AUC value for the Alvarado score that ranged between 0.74 and 0.88 was higher than the modified Alvarado score which had an AUC of 0.69 from a single study (Table 5).

The sensitivity of the Alvarado score ranged from 67.65 to 96.3%, while specificity ranged from 58.18 to 89.39% when the originally recommended cut-off of seven was used. This variability was also seen in Kalan’s modified Alvarado score where the sensitivity ranged from 53.8 to 97.6%, and specificity ranged from 28.57 to 80% for the same cut off value. This variability remained regardless of the quality of the studies (Table 5).

The AIR, Raja Isteri Pengiran Anak Saleha Appendicitis (RIPASA), Ohmman, Lintula and Eskelinen scores each had only a single validation study from which sensitivity and specificity could be obtained [19, 20, 38].

The AIR score showed a high sensitivity (92%) and moderate specificity (63%) at a cut-off value above five. This reverted to 20 and 97%, respectively, for a cut-off value above eight, which was the original cut-off recommended by the authors. The AUC values generated for this CPR ranged from 0.805 to 0.97, with an average value of 0.872 [20, 39, 40].

The Lintula score which was originally derived for use in paediatrics showed high performance in adults with a sensitivity of 87% and specificity of 96% [19]. The final score looked at in this study was based on repeated calculations for patients who were observed as inpatients. This is in comparison with other studies which only reported diagnostic indices based on scores at admission. There were no AUC values available for this CPR.

Erdem et al. validated the Alvarado, RIPASA, Eskelinen and Ohmann CPRs in a single study with a quality score of ten [38]. While the RIPASA, Eskelinen and Ohmann scores showed superior sensitivity and AUC values to the Alvarado scoring system, they showed poor specificity.

The pragmatic utility of these scoring systems (Table 5) demonstrated that the modified Alvarado score, Alvarado and AIR score are the most user-friendly CPRs. The use of decimal points and multiple weightings make the other scores difficult to calculate in a busy clinical setting.

Discussion

There are currently 12 published CPRs available to aid diagnosis of adults presenting with suspected appendicitis. These have been validated in 22 separate studies. The aim of this systematic review was to ascertain which of these available scores performed the best. The heterogeneity of included studies precluded the possibility of performing a meta-analysis. Based on a narrative review, however, it appears the AIR score performs the best.

Assessing the best performing CPR without meta-analysis meant narratively assessing sensitivity, specificity, AUC values, usability and the quality of available studies. Although the Lintula score performed highly in terms sensitivity and specificity, this score is difficult to use in a busy clinical setting and the comparability of the results obtained remains in question as the final score was based on repeated calculations as opposed to calculation at a single point in time. While the Eskelinen, RIPASA and Ohmann scores had good sensitivity and AUC values, they are difficult to calculate given the number of variables and range of weightings used. Thus, the overall best performer in terms of the quality of studies, results and usability was the AIR score. It is easy to calculate manually, and all parameters are easy to interpret except perhaps for the recommended subjective grading of rebound tenderness (as this requires clinical experience which may be limited in junior doctors). A score of ≥five appears to be better than the originally recommended cut-off of nine as there is lower number of missed diagnoses without a significant reduction in specificity.

The majority of published validation studies evaluated the Alvarado score and Kalan’s modified Alvarado score. This is probably because Alvarado was among the pioneers to generate a CPR as a diagnostic aid for appendicitis [8]. Although the Alvarado score is simple to calculate, the interpretation of left shift in neutrophils is time consuming. The results from the available studies demonstrated wide variation for both sensitivity and specificity. This variation was further emphasised as cut-off value increased and was also attributable to study design (e.g. prospective verses retrospective), variations in the characteristics of the evaluated patients, interpretation of variables of the CPR by different clinicians in different settings as well as the clinical expertise of the clinicians. While the overall sensitivity did not appear to show much variation between the Alvarado and modified Alvarado scores, the specificity appeared to be lower for the modified Alvarado score [8, 41]. Thus although the modified Alvarado score provides a more user friendly CPR, the removal of the left shift in neutrophils appeared to increase the number of false positives and was less accurate than the original CPR [8, 41].

Among derivation studies, there was wide discrepancy in the derivation methodology used. Multivariate logistic regression is known to be more reliable than using univariate analysis. This is highlighted by those CPRs derived with the multivariate method consistently identifying variables used in clinical practice [7, 42, 43] [44]. Variables such as rectal tenderness and diarrhoea that were identified in studies employing univariate analysis are seldom used clinically in the diagnosis of appendicitis [4447]. The reliability of multivariate logistic regression analysis is further emphasised by CPRs which used this methodology such as the Lintula, AIR and Eskelinen scores showing better sensitivity and AUC values compared to the Alvarado score which was derived using univariate analysis.

Several studies investigating CPRs for appendicitis conclude that clinical judgement is comparable to CPR stratification, especially when performed by a senior surgeon [21, 27, 30, 34, 48]. While this could imply that CPRs do not improve diagnostic accuracy compared to a senior surgeon, it provides evidence that CRPs can improve diagnostic accuracy to the level of an experienced surgeon when used by less experienced staff [21, 30, 48, 49]. Given that junior staff usually undertake initial evaluation of patients with suspected appendicitis, the use of a CPR is valuable in this context. Patient care is likely to be more standardised and unnecessary exposure to radiation and invasive investigations, including laparoscopy, minimised.

The heterogeneity and quality of included studies precluded meta-analysis of available data. A further limitation was the pre-defined age criteria as many of the studies included children were excluded because the finding for children and adults could not be separated. The exclusion of non-English publications may also have excluded important validation studies done in other populations.

Conclusion

There are currently 12 CPRs available for use in adults with suspicion of appendicitis. Heterogeneity in methodology and quality of available studies precluded a meta-analysis. The AIR score performed best in terms of sensitivity, specificity AUC values and usability but has been validated in only a small number of studies. The Alvarado and modified Alvarado were the most commonly validated CPRs, but their performance was variable. The original Alvarado score outperformed the modified Alvarado score across all three criteria (sensitivity, specificity and AUC values).