Introduction

Anxiety, depression and distress are commonly measured as outcomes in research evaluating psychosocial interventions for people with cancer. While blinded diagnostic interviews offer standardised data on the proportion of patients with clinical disorders, this approach is resource intensive and limits the power of between-group analyses. Patient-reported outcome measures (PROMs) provide a popular alternative that can yield continuous data on lower as well as higher ranges of disorders with low resource requirements. Unfortunately, however, there is no accepted gold standard PROM for anxiety, depression or distress, and prevalence varies according to the PROM used. Choosing the most appropriate PROM requires appraisal of content, psychometric properties, track record, interpretability and practical issues (e.g. patient burden, language availability and cost) with reference to a specific research context. The only previous, comprehensive review in oncology evaluated questionnaires for use in screening women with breast cancer rather than outcome measurement in cancer more generally [1]. While evidence from screening is important in establishing that scores are clinically meaningful and assists with their interpretation, a PROM’s suitability as an outcome measure is also dependent on it being responsive to change.

The current authors set out to identify optimal PROMs of anxiety, depression and distress for evaluating psychosocial interventions for English-speaking adult patients receiving active treatment for heterogeneous cancers. Recommendations are made within the context of PROMs’ advantages and disadvantages for different applications.

Methods

A review was designed to evaluate candidate PROMs against the following criteria:

  1. Criterion A:

    Suitability for people undergoing active treatment for cancer of any type and stage;

  2. Criterion B:

    Reliability and validity in English-speaking cancer patients;

  3. Criterion C:

    Track record in identifying treatment effects in randomised control trials (RCTs) of psychosocial interventions;

  4. Criterion D:

    Clinical meaningfulness of scores;

  5. Criterion E:

    Availability of comparison data from cancer and general populations;

  6. Criterion F:

    Efficiency (number of items and number of constructs assessed);

  7. Criterion G:

    Ease of administration and cognitive burden.

The review proceeded in six steps:

  1. 1.

    Identification of all anxiety, depression and distress PROMs used in RCTs of psychosocial interventions for English-speaking cancer patients published since 1999;

  2. 2.

    Rapid filter against criteria A and B to select promising candidate questionnaires;

  3. 3.

    Description of candidate PROMs;

  4. 4.

    Detailed review of evidence for reliability and validity ;

  5. 5.

    Review of capacity to detect effects in RCTs of psychosocial interventions;

  6. 6.

    Synthesis of information collected in Steps 1 through 5 aimed at developing recommendations against the criteria above.

Step 1: identification of anxiety, depression and general distress PROMs

Systematic review strategy

The following databases were searched in May, 2009: Medline, PsycINFO, Embase, AMED, CENTRAL and Cinahl. Further RCTs were sought via review reference lists [221]. The systematic review was limited to results from English-speaking samples to minimise problems in generalising results across languages and cultures. Assuming that methodology has developed and more measures have become available over time, we focused on the previous 10 years to reduce the prominence of PROMs that have become obsolete.

Psychosocial interventions were defined and searched for using a list collated by Jacobsen and Jim ([14], p.217). This list was supplemented with search terms used in other systematic reviews [1422].

Where reports did not describe the study language(s) or location(s), author affiliations to institutions in Australia, Canada, Ireland, New Zealand, UK and USA were assumed to indicate English-speaking participants. International studies that included a sample from one or more of these countries were included.

In the psycho-oncology literature, the terms anxiety and depression are generally used to refer to clinical diagnoses. Distress is less well defined and is often used as an umbrella term for any multi-factorial, unpleasant emotional experience [23]. In keeping with this general concept of distress, PROMs described as assessing mood, emotion or stress were included alongside those explicitly described as distress measures. We excluded PROMs assessing clearly defined psychological constructs other than, contributory to or component of general distress such as coping, adjustment, self-esteem, post-traumatic stress disorder (PTSD) or the emotional domain of quality of life. Because anxiety and depression are included within the continuum of general distress, scales that combined these constructs with other emotional experiences were included pending analysis of content at Step 3 and characterisation of psychometric properties at Step 4.

Step 2: rapid filtering

Samples of candidate PROMs identified in Step 1 and current, alternative versions (e.g. short-forms) were obtained together with supporting information from manuals and websites.

Candidates were filtered against criterion A via an item-by-item review of their contents. PROMs were excluded if (a) they were deemed suitable for use only in specific populations or (b) they comprised one-third or more items likely to be confounded by symptoms and side effects (e.g. somatic symptoms, cognitive functioning, restlessness, sexual interest) or concerns ubiquitous in certain cancer groups (e.g. preoccupation with health or mortality). The rationale for excluding PROMS with somatic content a priori was that evidence for sound performance in one clinical context could not be generalised to all cancer types, stages and treatments. One-third of the items was agreed by the authors prior to review as a proportion at or above which problematic items would have the potential to seriously compromise scales in some settings.

To pass rapid filtering against criterion B, PROMs were required to have at least some published data concerning reliability and validity in an English-speaking cancer sample. Types of reliability of interest were internal consistency, test–retest reliability and inter-rater reliability between patients and proxies; types of validity were content, internal, convergent/divergent, discriminant, criterion and predictive. Psychometric articles were identified via manuals and websites and through further searches of Medline and PsycINFO using the name and acronym of each candidate PROM combined with the medical subject heading (MeSH) terms ‘neoplasms’ and ‘psychometrics’ and the key words ‘neoplasm$’ and ‘psychometric$ OR valid$ OR reliab$’. Further articles were identified via the reference lists of relevant articles and reviews. A detailed review of evidence for psychometric properties was postponed until Step 4.

Step 3: description of candidate PROMs

A description of each PROM meeting criteria in Step 2 was compiled from relevant articles, websites, publications and manuals. Information included PROM content, number of items, time to administer, response options, recall period, scoring, translation availability, licensing requirements and costs. The content of PROMs was compared with DSM-IV-TR criteria [24] for generalised anxiety disorder and major depressive disorder as a method complementary to psychometric evidence reviewed at Step 4 for delineating measures of general distress from those of anxiety and depression. A thesaurus was used to map between terms with similar meanings.

Step 4: review of evidence for reliability and validity

Quality of evidence for reliability and validity, responsiveness and information to assist with interpretation of scores was evaluated using a checklist adapted from the COnsensus-based Standards for the selection of health status Measurement INstruments (COSMIN) [25] and the Australian-developed Dementia Outcomes Measurement Suite (DOMS) project checklist [26].

Evidence for each property was independently evaluated by two reviewers until inter-rater reliability (kappa > 0.60) was achieved on at least 25 rating pairs. Any disagreements were resolved through discussion.

Step 5: review of capacity to detect effects of psychosocial interventions

PROM capacity to identify effects in RCTs of psychosocial interventions was evaluated via reference to studies identified in Step 1. RCTs using candidate PROMs were identified and subjected to further exclusion criteria aimed at reducing limitations due to flawed design and methodology. Studies were excluded if they had an overall sample size of less than ten or had an attrition of more than 15% from baseline [27], except where evidence was provided that drop-out was unlikely to bias results. RCTs that used more than one candidate PROM were exempt from these exclusion criteria because of the information they provided about relative performance.

Reports were reviewed to identify instances where a candidate questionnaire had demonstrated an effect size (ES) of at least 0.2 [2830]. We used ES because it provides a standard unit for comparison across studies and, unlike statistical significance, is independent of sample size. Effect sizes were calculated as the difference between changes in group mean scores from baseline to follow-up divided by the pooled standard deviation at baseline. Conventionally, an ES of 0.2 is considered small, 0.5 moderate and 0.8 large [31]. Studies where no ES ≥ 0.2 was identified were treated as uninformative because the intervention or research design may have been at fault rather than the PROM. Inadequacy of a given PROM was inferred only where an ES of ≥0.2 was observed on another measure assessing a similar construct.

To determine if different interventions might be best evaluated by different PROMs, we compared performance across intervention types. Interventions were categorised by three authors using an adapted version of the classification proposed by Jacobsen and Jim [14].

Step 6: synthesis and overall ratings

Information obtained in Steps 1 to 5 was synthesised to compare candidate PROMs against criteria B through G. A summary score for each PROM was generated by two reviewers using the system in Table 1. Each property was assigned a raw, categorical score of 0, 5 or 10. Raw scores for each property were then weighted according to the relative importance perceived by the authorship team. Ratings were intended to rank the PROMs rather than place them on an interval scale. Cost per use was excluded from overall scores due to variation between countries and dependency on study resources. Ease of administration and cognitive burden were evaluated by the two reviewers based on trial administrations of each PROM and item-by-item reviews concerning wording, recall period and response options in each case.

Table 1 Criteria and weights for use in generating overall scores for measures of anxiety, depression or distress

Results

Step 1: identified PROMs

Altogether, 173 psychosocial RCT interventions were identified. Of these, 132 assessed anxiety, depression and/or distress by means of a total of 30 PROMs (Table 2).

Table 2 PROMs used to assess anxiety, depression and/or distress in English-language RCTs of psychosocial interventions published since 1999

Step 2: PROMS excluded by rapid filtering

Total scores on the SCL-90-R, BSI-53/18, MHI-38, GHQ-28 and POMS-65/Bi-Polar/30 and the unofficial short-form, the POMS-37, were discounted as distress measures because they are generated through summation of subscales assessing a range of psychological constructs, many of which have a somatic emphasis.

Table 3 gives information about other scales excluded during filtering.

Table 3 PROMs excluded after failing to meet criteria A or B during rapid screening at Step 2

Step 3: description of candidate PROMs

PROMs passing rapid screening for criteria A and B at Step 2 are detailed in Table 4.

Table 4 Characteristics of candidate PROMS that passed screening against criteria A and B at Step 2

Table 5 summarises item-by-item content for each scale and compares these with terms in the DSM-IV-TR criteria [24]. While there are overlaps between the criteria for generalised anxiety disorder and major depression, the content of all candidate anxiety and depression scales emphasises the disorder which they were designed to assess. All distress measures except the DT place heavier emphasis on depressive symptoms than those associated with anxiety.

Table 5 Content comparison between DSM-IV-TR criteria and candidate PROMs

Step 4: review of evidence for reliability and validity

Thirty eight articles were amenable to evaluation using the checklist. Inter-rater reliability of kappa > 0.60 was achieved for internal consistency (kappa = 0.71), criterion validity (kappa = 0.82), discriminant validity (kappa = 0.63) and convergent validity (kappa = 0.67). Other properties were reported too infrequently for inter-rater reliability to be properly assessed. No articles were found reporting on inter-rater reliability, content validity or floor and ceiling effects for any PROM. Ratings of these articles are summarised in Table 6.

Table 6 Evidence for validity, reliability, responsiveness and interpretability of each candidate PROM in English-speaking cancer populations

Step 5: review of capacity to detect effects of psychosocial interventions

One hundred and five RCTs identified in Step 1 used a candidate measure meeting both criteria in Step 2. Of these, 63 studies were excluded due to small sample size or attrition >15%. Articles reporting the remaining 42 trials were reviewed for information on samples, interventions and ESs for each PROM (Table 7).

Table 7 Total 42 samples included for effect size (ES) calculation in Step 4

Based on the range of interventions in which an ES of ≥0.2 has been identified by each PROM, broad-scale utility for intervention studies is best supported for the HADS (Table 8). The exception is exercise/physical related complementary and alternative medicine, for which only the POMS-65 has identified an intervention effect.

Table 8 Largest effects identified by each PROM in evaluating various types of psychosocial intervention

Step 6: synthesis and overall ratings

Table 9 provides summary ratings against criteria A through G using the weighted checklist (Table 1 above).

Table 9 Weighted scores for PROMs assessing anxiety, depression and/or distress

Discussion

The current review set out to evaluate measures of anxiety, depression and distress with the aim of informing PROM selection for studies evaluating psychosocial interventions for English-speaking adults with heterogeneous cancer diagnoses. The Hospital Anxiety and Depression Scale (HADS) scored highest overall (weighted score = 77.5), the unofficial short-form, Profile of Mood States-37 (POMS-37), second (weighted score = 60), and the original POMS (POMS-65) and Centre for Epidemiological Studies Depression Scale (CES-D) joint third (weighted score = 55).

HADS

The HADS scored highest overall due to the wealth of evidence for its psychometric properties and its efficiency in providing scores for anxiety, depression and distress using only 14 items. However, questions have been raised about the role of the HADS overall score (HADS-T) relative to HADS-A and HADS-D, the implications of the HADS’ avoidance of somatic content, its emphasis on anhedonia in evaluating depression, and appropriate scale cut-offs. Each of these issues will be discussed in turn.

Although the HADS manual [32] advises against generating a total score, the HADS-T has been widely used both in screening and outcome measurement as an index of overall distress. Content analysis at Step 3 suggests that only three items in the HADS-A and none in the HADS-D assess emotional experiences distinct from those defining generalised anxiety disorder and major depressive disorder. However, several studies have found the HADS-T to be superior to one or both subscales in identifying clinical levels of distress variously defined [3337]. Rasch analyses of data both from cancer patients [36] and patients attending musculoskeletal rehabilitation [38] have also been supportive of the HADS-T, although results from factor analyses have been more mixed [33,3942]. The HADS-T was reported too infrequently in the RCTs reviewed here to comment on its relative responsiveness. However, on balance, we decided to accord the HADS-T the status of a measure of general distress pending future psychometric evaluation.

The HADS’ omission of somatic items is intended to avoid confounding psychological symptoms with disease or treatment. This aptitude has been confirmed in breast cancer by a factor analysis that found all items to load more strongly onto a psychological than somatic factor [40]. However, while the HADS has performed well as a screening measure in samples with poorer health status and those on active treatment, it has also performed surprisingly well in those who were disease-free and the general population, and relatively poorly in those with metastatic or progressive disease [33,37,4345].

Unexpectedly poor performance in advanced cancer may be explicable in terms of the HADS’ emphasis on anhedonia, which may occur in this group for reasons other than depression [33]. While anhedonia is an important feature of major depression, there is concern that the HADS’ may have limited sensitivity to minor depression or adjustment disorder with depressed mood [1]. Consistent with this concern are several studies showing the HADS to be better at screening for anxiety than for depression [4648]. However, a recent meta-analysis across language versions found the HADS-D to be superior to both the HADS-A and HADS-T in ruling out, and similar in ruling in, cases of mixed affective disorders (depression, anxiety, adjustment disorders combined; fraction-correct scores = HADS-D 78.3%; HADS-A 65.9%; HADS-T 72.6%) [49]. In the current review, the HADS-D was also found to perform comparably to the other scores in identifying intervention effects in RCTs of psychosocial interventions. As a result, we recommend continued use of the HADS-D in combination with the HADS-A and HADS-T where mixed affective disorders are the outcome of interest.

Optimal cut-offs for the HADS have varied between studies, with those recommended by its developers performing poorly in some cases [46]. Data from different studies are difficult to compare given variation in the prevalence of anxiety and depression between samples. Most informative are comparisons between more than one candidate measure in the same sample. Results from four studies of this type are available, all of which have included the HADS (see Table 10). The HADS has performed better than or comparably with other measures in nearly all cases.

Table 10 PROM screening performance in studies comparing two psychological measures with a diagnostic interview

Screening performance is relevant to the HADS’ use as an outcome measure because it enables clinical meaning to be attached to differences or changes in scores. Outcomes on the HADS have been reported descriptively in some cases (e.g. means, standard deviations) and with reference to clinical cut-offs (e.g. percentage above and below) in others. To facilitate comparison between studies, future publications need to report results in both forms and use the same cut-offs. Cut-offs for the HADS-A and HADS-D recommended in the original article have been used most widely, namely: normal = 0–7, borderline = 8–10, clinical = 11–21 [50]. Further research is needed to confirm the optimal cut-off for the HADS-T.

POMS

The POMS-37 unofficial short-form scored highest after the HADS due to consistent evidence for its validity and responsiveness (weighted score = 60). The POMS-37 was developed specifically for use with cancer patients who may be too unwell to complete all original 65 items (POMS-65, weighted score = 55) [51]. It has correlated strongly with and matched the performance of the POMS-65 in three samples [5153]. Unlike both the POMS-65 and HADS, the POMS-37 is free to use. All POMS versions differ from other PROMs reviewed here in that they have not been designed or used to screen for psychological disorders but rather to assess mood. However, the content and performance of the POMS-65 and 37 suggest these may have clinical utility yet to be explored. Items from the tension–anxiety and depression–dejection scales of both versions closely resemble DSM-IV-TR criteria, albeit with heavy emphasis on the cardinal features rather than across the gamut of symptoms. The depression–dejection scale of the POMS-65 has also correlated highly with the CES-D (0.63 [54]; 0.80 [55]), which, together with the SCL-90-R, offers the most comprehensive assessment of DSM criteria of PROMs reviewed here.

CES-D

The CES-D (weighted score = 55) has performed consistently well in screening and as an RCT outcome measure. However, its criterion validity has been evaluated in only two, small-scale studies focusing exclusively on major depression [34,56]; in one of these, its specificity was inferior to that of the HADS-T (see Table 10 above). Future studies are needed to evaluate the validity of items assessing problems with sleep, appetite and concentration in samples on active treatment. A further limitation of the CES-D concerns its somewhat “idiosyncratic” [1] assessment of symptom frequency rather than severity, leading to a mid-range rating regarding ease of administration and cognitive burden. The final feature that counted against the CES-D was the fact it requires 20 items to assess only one construct. Three short-forms of the CES-D have been developed, a 10-, an 11- and 15-item version. Neither the 10- nor 11-item versions have been validated in cancer patients, though the 11-item version performed satisfactorily in a sample of disease-free breast cancer survivors [57]. The 15-item version was developed using factor analysis after the two interpersonal items from the long-form were found uninformative and three further items were found to have an unacceptable gender bias in patients with heterogeneous cancer diagnoses and caregivers [58]. This version may show promise in the future.

Limitations

The current review focused on evidence for performance of PROMs in measuring anxiety, depression and distress in English-speaking adult patients receiving active treatment for heterogeneous cancers. The results should not be generalised to other linguistic or clinical contexts such as paediatric or palliative care.

Our review’s most important limitation concerns the potential for selection bias in identifying and reviewing evidence for each PROM. Limits on evidence for psychometric properties and PROM somatic content were imposed to maximise confidence when recommending use across cancer diagnoses and treatments. However, availability of published evidence is subject to a range of factors beyond PROM performance. PROMs may be used because they offer a ‘safe’ choice or potential for comparing with previous research rather than because they are suited to a particular study. This phenomenon introduced unavoidable bias to the review process. On the other hand, publication bias should not have posed a problem inasmuch as poorly performing PROMs can be expected to have presented a lower profile, consistent with the aims of the review. PROMs widely used in health settings but excluded because of a surprising lack of evidence for psychometric properties in English-speaking cancer populations were the State-Trait Anxiety Inventory (STAI) [59], Depression Anxiety Stress Scales (DASS) [60] and Beck Depression Inventory—Primary Care (BDI-PC) [61]. Item banks for administration via computer-adaptive testing are also becoming available for assessing anxiety and depression in cancer clinical research. Further research is needed to establish whether any of these alternatives offer advantages over the HADS, CES-D and POMS. Development of new, purpose-built PROMs aimed at addressing shortfalls in existing measures would be resource intensive and limit comparison between studies in and outside oncology.

Due to lack of precedent, decisions regarding inclusion and exclusion criteria were in some cases necessarily based on the authors’ expert opinion rather than on published evidence. Perhaps most important were our operational definitions of anxiety, depression and general distress, which excluded PROMs widely used in cancer research to assess coping, adjustment and PTSD. Readers are encouraged to critically appraise the relevance of these decisions within the context of the objectives and samples of each intended application. The weightings allocated to each property in our rating system could also be adapted to meet alternative requirements, although the fact that PROMs were ranked similarly by raw and weighted scores suggests the results are fairly robust.

A final limitation arose from the fact that only one RCT identified in Step 1 enabled head-to-head comparisons between candidate measures of the same construct [62]. More results of this type would have been valuable in distinguishing PROM performance from the influence of design and methodology, and future studies that compare PROMs are encouraged.

Recommendations

The HADS’ efficiency and substantial track record recommend its use where anxiety, mixed affective disorders or general distress are outcomes of interest. Where cost is a concern, the POMS-37 is recommended to measure anxiety or mixed affective disorders but does not offer a suitable index of general distress and, like the HADS, emphasises anhedonia in measuring depression. Where depression is the sole focus, the CES-D is recommended.

Further research is needed to inform interpretation of scores from the HADS-T and compare its performance with the HADS-D in assessing depressive disorders other than major depression. Evidence of this type has potential to inform not only outcome measurement but also screening in cancer clinics, which is becoming increasingly popular [63].

More generally, future studies are needed that directly compare PROMs of the same construct to assist researchers in choosing the most appropriate outcome measure for a given research context.