Severe fatigue can be a disabling experience for people with cancer. Fatigue is recognised as one of the most prevalent and debilitating symptom of cancer and affects up to one third of cancer survivors [1, 2]. The US National Comprehensive Cancer Network (NCCN) defined cancer-related fatigue (CRF) as ‘a distressing, persistent, subjective sense of physical, emotional, and/or cognitive tiredness or exhaustion related to cancer and/or cancer treatment that is not proportional to recent activity and interferes with usual functioning’ [3].

Implementation of robust evidence-based guidelines is arguably needed for the effective management of CRF [4, 5]. Several guidelines exist although they appear to be under-utilised by oncology health professionals [4, 6].

The research question for the current study was ‘Which clinical practice guideline for cancer-related fatigue is the most suitable for application?’ A guideline is defined as ‘a rule or instruction that shows or tells how something should be done’ [7]. Clinical practice guidelines are ‘statements that include recommendations intended to optimise patient care that are informed by a systematic review of evidence and an assessment of the benefits and harms of alternative care options’ [8] p.4. The purpose of clinical guidelines is to assist practitioners and patients choose the most appropriate therapeutic interventions [8, 9]. To determine validity of clinical guidelines, a number of factors can be considered. These include validity; development procedures; stakeholder involvement; peer review; the level, quality, and completeness of evidence, and the clarity of published recommendations [8, 10, 11].

Methods

Guideline search and short listing

We conducted a systematic search for published guidelines for screening, assessment and treatment of CRF using terms described by Howell et al. [12] (Fig. 1). The databases MEDLINE®, PsychINFO, EMBASE®, and CINAHL® were searched in July 2015. The search for CRF guidelines extended to websites of four guideline portalsFootnote 1 and Google search engine. Websites of identified guideline developers were checked for recent updates. Reference lists of guidelines and reviews describing guidelines were examined.

Fig. 1
figure 1

Summary of methods used in the appraisal

The inclusion and exclusion criteria were applied by one researcher (EP). The target population included adults in the post-treatment phase with a diagnosis of any cancer type or disease stage. Guidelines for assessment or treatment of CRF and written in English were considered if they detailed the development methodology. To ensure recommendations were evidence-based, care plans or algorithms without explicit links to the evidence were excluded. Older versions of guidelines, including those developed more than 5 years ago without an update, were eliminated due to the rapidly changing evidence base. Five CRF guidelines met the inclusion criteria. Guideline developers were the American Society of Clinical Oncology (ASCO) [13], the National Comprehensive Cancer Network (NCCN) [3, 14], the Oncology Nursing Society (ONS) [15], and the Canadian Association of Psychosocial Oncology (CAPO) [12, 16].

Previous quality appraisals of included guidelines using the AGREE-II instrument were used to short list the most rigorously developed guidelines for further appraisal. The AGREE-II instrument [17] is a valid guideline quality appraisal tool used by guideline developers [18] and statutory bodies [11]. The AGREE-II has 23 items within six domains of scope and purpose, stakeholder involvement, rigour of development, clarity of presentation, applicability, and editorial independence. Each item is rated on a Likert scale from 1 (strongly disagree) to 7 (strongly agree). Domain scores are reported as a percentage of the maximum possible score [18]. Initially, quality scores for ‘rigour of development’ (domain 3) were compared. Guideline scope and other domain scores assisted the overall decision. Recent guidelines without published scores were also considered for appraisal.

Appraisal methodology

Two instruments were used to appraise the guidelines: the AGREE-II instrument [17] and a checklist of Australian National Health and Medical Research Council (NHMRC) guideline standards [11]. AGREE-II [17] has been endorsed as the most comprehensive appraisal tool for local, national, and international clinical practice guidelines [19, 20]. Evaluation using AGREE-II is limited to the appraisal of the quality of guideline development processes and documentation [19]. The second tool was used to extend the evaluation.

For endorsement and approval for practice, clinical practice guidelines need to meet scientific standards. The NHMRC standard aligns with and expands upon most of the 20 US Institutes of Medicine clinical practice guideline standards [8]. Therefore, this appraisal was considered to be internationally relevant. For NHMRC approval, there are 54 mandatory criteria to be met, and additional 33 desirable criteria are listed [11]. The NHMRC domains are governance and stakeholder involvement, scope and purpose, evidence review, guideline recommendations, structure and style, public consultation and dissemination, and implementation.

Seven standards were considered not applicable to international guidelines due to specific references to NHMRC processes or indigenous people. The selected CRF guidelines were evaluated against a checklist of 47 guideline requirements [11]. Categorical responses were ‘met’, ‘not met’, or ‘not applicable’. Reviewers recorded qualifying statements and the location of the evidence in the document. Guidelines, technical reports, administrative reports, and developer websites were the key sources of information used in the appraisals.

The AGREE Research Trust [21] recommended that at least two, and ideally four reviewers, should independently appraise each guideline (www.agreetrust.org). Four reviewers including one consumer were purposively recruited by direct invitation. Inclusion criteria were determined to ensure an informed multidisciplinary review panel. The inclusion criteria were a relevant qualification in medicine, nursing, or occupational therapy AND expertise in clinical practice or research in the field of cancer supportive care; OR a consumer of health care with sufficient knowledge in guideline evaluation to complete the appraisal. Details of the reviewers’ professional discipline, qualifications, age, gender, and location were recorded. Reviewers were offered payment for 8 h at a senior postdoctoral rate.

The four reviewers included an oncology nurse coordinator, a medical oncologist, an occupational therapist, and a consumer representative. All reviewers were female with tertiary qualifications at bachelor level and average age of 42.5 years (SD 12).

After written consent was obtained, reviewers were sent relevant guideline documentation, website links, and electronic versions of the NHMRC checklist and AGREE-II rating forms. A link to online training for AGREE-II and the user manual were provided. Reviewers were instructed to read the guideline documentation in detail and then rate their level of agreement with statements in the AGREE-II instrument and whether each NHMRC standard was met. The completed forms were returned to the research team.

The La Trobe University Human Ethics Sub-Committee of the College of Science, Health and Engineering approved all procedures in this study, reference number FHEC14/270.

The results from the four reviewers for each appraisal tool were tabulated into spreadsheets and analysed using Statistical Package for Social Sciences (SPSS®) software version 22 (IBM®). Non-parametric tests were used to determine the statistical significance of differences in domain scores between the two guidelines because of the small sample size [22].

The null hypotheses tested in the analyses were that the median of differences between guidelines of AGREE-II domain scores and of the proportion of NHMRC standards met in each domain equals zero (p < 0.05).

Data handling

Raw data for each instrument were modified to enable comparisons. The raw data obtained for each standard in the NHMRC appraisal was a categorical variable for each reviewer. Two researchers independently adjusted any ambiguous ratings to either ‘met or not met’, using the reviewers’ notes. ‘Unsure’, ‘N/A’ or ‘partly met’ were adjusted to ‘not met’. ‘Mostly met’ was adjusted to ‘met’. Adjustment discrepancies were resolved by mutual agreement between the researchers.

The research team defined compliance for each NHMRC standard a priori as being positively endorsed by at least three of the four reviewers. Using this definition, the data were further adapted to an overall rating of standard ‘met’ or ‘not met’. If one or two reviewers rated a standard as ‘met’, reviewer notes were used to determine whether the overall rating should be changed. The proportion of standards met in each domain was used as the unit for comparative analysis.

The AGREE-II unit of analysis is the domain score and is expressed as a percentage of the maximum possible score [17]. Domain scores were calculated using the formula specified by The AGREE Research Trust [21]:

$$ \frac{\left(\mathrm{Obtained}\ \mathrm{score} - \mathrm{Minimum}\ \mathrm{possible}\ \mathrm{score}\right)\times 100}{\mathrm{Maximum}\ \mathrm{possible}\ \mathrm{score} - \mathrm{Minimum}\ \mathrm{possible}\ \mathrm{score}.} $$

Individual reviewer scores, median, and overall scores were determined for each AGREE-II domain. Changes in reviewer ratings for each domain were plotted using Minitab® statistical software.

Analysis

Inter-rater reliability was calculated in SPSS® using the Kappa statistic for individual adjusted NHMRC data sets and the intra-class coefficient (ICC) for AGREE-II scores. The Kappa statistic was used to determine inter-rater agreement for independent evaluation of all cases by the same reviewers, using categorical variables with same number of categories [23]. For the NHMRC ratings, Kappa was calculated for each pair of raters (6 pairs) and the results were averaged as described by Light [24]. Proportions of agreement for adjusted NHMRC ratings were conducted for each of 6 pairs of reviewers using the tool at http://vassarstats.net. A two-way mixed, consistency average-measure ICC was calculated to assess the degree of reviewer consistency in rating of AGREE-II domains. This approach reflected the non-random sample of reviewers rating the guidelines, the unit of analysis as the ‘average rating’, and consistency of response as appropriate for Likert scales [23]. The AGREE Rating Concordance Calculator (available from Guidelines Resource Centre at www.cancerview.ca) was used to determine whether additional reviewers were required. Decision rules were based on the standard deviations of raw scores for both of the guidelines.

McNemar’s exact test was performed using SPSS® to evaluate the statistical significance of differences in the proportion of NHMRC standards met in each domain. McNemar’s test evaluates the significance of differences in pairs of dichotomous variables using 2 × 2 contingency table [25]. A related-sample Wilcoxon signed rank test [22] was performed in SPSS® to determine the significance of the difference of AGREE-II domain scores between the two guidelines. Significance level was set at p < 0.05 for all analyses.

Results

Of the included guidelines, the CAPO, ONS, and NCCN Fatigue guidelines have application in all stages of cancer. The ASCO and NCCN Survivorship guidelines are specific to disease free survivors of adult–onset cancer. Guideline development methodology, target populations, and evidence categories varied between the guidelines as summarised in Table 1. Four guidelines included screening, assessment, and treatment of fatigue. The Oncology Nursing Society (ONS) guideline [15] focused on assessment and treatment of CRF only. Guideline recommendations were relatively consistent, but some differences were apparent, particularly evidence level. The guideline recommendations are summarised in Appendix 1.

Table 1 Characteristics of included guidelines

Five publications were identified that reported AGREE-II results for one or more fatigue management guideline [12, 13, 16, 28, 29]. The domain scores for each pair of reviewers are shown in Table 2. Domain 3 represents ‘rigor of development’.

The results of the appraisal by Bower et al. [13] suggested that the Canadian Association of Psychosocial Oncology (CAPO) fatigue guideline [12] was developed with substantial rigour, compared with the 2013 NCCN fatigue and NCCN survivorship guidelines. Two reviews compared the NCCN fatigue and ONS guidelines. Both rated the NCCN guideline’s rigour very low, at approximately half the ONS rigour scores [12, 29]. Domain scores for the 2014 NCCN fatigue guideline in the review by Howell et al. [16] were markedly higher than other reviews. This was considered to be inconsistent with previously reported scores, perhaps due to individual rater marking styles. The two NCCN guidelines and ONS guideline were then eliminated from further consideration in this study due to lower methodological rigour and lack of screening recommendations in the ONS guideline.

Based upon two independent reviews [13, 28], the CAPO fatigue guideline was selected and the 2015 version was appraised [16]. The ASCO fatigue guideline for survivors [13] was also selected due to promising scores in several domains in its only review [16].

  1. Appraisal (1):

    NHMRC guideline standards

Inter-rater agreement across both guidelines using Light’s kappa [30] was 0.48 with a standard error of 0.09, indicating a moderate agreement between reviewers (0.41 ≤ κ ≥ 0.6) [24] (see Table 3). This was consistent with Kappa values calculated for each guideline. The mean observed proportions of agreement was 0.76 with 95 % CI 0.56 to 0.93 (data on request).

Table 2 Published AGREE-II domain scores for fatigue guidelines

The number and proportion of standards per domain meeting the 47 NHMRC guideline standards are shown in Table 4. The CAPO guideline met 37 standards and the ASCO guideline met 20 standards. The proportion of standards met by each guideline differed for four of seven domains. The difference was statistically significant with a moderate effect size for domain D ‘Recommendations’ (p = 0.008), see Table 4.

  1. Appraisal (2):

    AGREE-II instrument

Table 3 Inter-rater agreement for NHMRC standards for both guidelines using Cohen’s κ

The inter-rater reliability ICC for all domain scores was 0.86 (95 % CI 0.66 to 0.95) for absolute agreement and 0.89 (95 % CI 0.73 to 0.96) for consistency. Separate ICCs calculated for each guideline did not differ substantially from the combined ICC. These figures were in the excellent range as defined by Cicchetti [31]. According to AGREE Trust decision rules, additional reviewers were not required [21].

Table 4 Results of NHMRC evaluation

Domain scores (mean), median, and range using the AGREE-II are reported in Table 5. A related-sample Wilcoxon signed rank test was performed in SPSS to test the null hypothesis. The sum of median differences was 19 (p = 0.046) in favour of the CAPO guideline, which was not consistent with the null hypothesis. Further analysis considered the comparison of median scores for separate domains, using the results at the reviewer level (data on request). The largest median difference in individual domains was for Editorial Independence; this comparison was statistically significant with higher scores for CAPO (p < 0.05). This difference represented a small effect size according to Cohen’s criteria [32].

Table 5 Results of AGREE-II evaluations

Discussion

The trustworthiness of the study was enhanced using four independent reviewers to rate two guidelines with two instruments. The study rigour was increased through the use of an internationally recognised valid guideline appraisal instrument, the AGREE-II [17, 19] together with national guideline standards [11]. Few statistically significant differences in domain scores were found in this analysis. It is noted that for domains with very few items, small p values and consequent rejection of the null hypothesis were not expected due to the influence of measurement error [33]. The AGREE-II domain of Editorial Independence has only two items, and the clinical relevance of this finding was uncertain. In contrast, the significant finding of a moderate effect size for the 12-item NHMRC Recommendations domain indicated important differences in the methodological quality of the recommendations between the guidelines. The significant AGREE-II overall result suggested that the quality of development and reporting for the CAPO fatigue guideline was superior to that of the ASCO guideline. Additionally, the CAPO fatigue guideline met 17 more NHMRC guideline standards than did the ASCO. This suggested that the CAPO is the more suitable guideline for clinical use.

This study had several limitations. Although short-listing of guidelines using AGREE-II was recommended [18], the use of previously published scores to guide selection is novel. Benchmark AGREE-II quality scores have not been published, and relative scores were used to rank quality domains. Readers should not consider these scores as absolute but rather in the context of other ratings by the same pairs of reviewers. Comparing different iterations of guidelines could be misleading if development methodology and evidence changes. The use of NHMRC standards as a tool by which to compare guideline properties is also novel. Because the domain constructs of the NHMRC Standards remain untested, their validity may be questioned [34].

The AGREE-II and NHMRC guideline standard evaluations were dependent on obtaining accurate and complete guideline documentation. It is possible that materials were overlooked or incomplete, which could result in lower ratings [35]. Dichotomous scoring in the NHMRC appraisal may have also reduced scores if a standard were partially met [35]. To address this, we adjusted the ambiguous ratings based on the notes of all four reviewers. The overall results were unchanged by this procedure.

Both of the tools used appraised documentation and methodological quality [19]. The guidelines both scored poorly for applicability or implementation. Neither the validity nor clinical appropriateness of guideline recommendations was evaluated in this appraisal. Additional evaluation methods are required to determine the acceptability, feasibility, and effectiveness of the guidelines. Further evaluation could include pre-implementation studies using end-user feedback [36], tools such as the GuideLine Implementability Appraisal [37], and pilot clinical studies [38].

Conclusions

The 2015 CAPO guideline for cancer-related fatigue appears to be appropriate for clinical use worldwide. Further enhancement of the guidelines is needed to enable application to local contexts. It is recommended that guideline developers make the application of evidence-based guidelines easier to enhance their implementation.