Introduction

Determining the minimal important difference (MID) [14] for interpreting health-related quality of life (HRQOL) scores from cancer clinical trials is useful to clinicians, patients, and researchers as a benchmark for assessing the effectiveness of a health care intervention and for determining the sample size in a clinical trial. Benchmarks for interpreting differences between groups cross-sectionally may differ from those for interpreting changes over time within groups [5].

Methods aimed at identifying MIDs are classified as either anchor-based or distribution-based [6]. Anchor-based methods link HRQOL measures either to known indicators that have clinical relevance (e.g., progression of disease, performance status (PS), etc.) or to patient-derived ratings of change in health [1, 6]. Distribution-based approaches hinge on summary statistics calculated from the HRQOL data; two commonly used statistics are the effect size [7] and standard error of measurement (SEM) [8]. The effect size used in the MID literature is the mean change divided by the between-person standard deviation; this aids interpretation by benchmarking the mean change against the degree of variation among individuals. An effect size of 0.2 standard deviations (SD) of HRQOL scores has been proposed as a definition for a minimal clinically important difference [9]. Some suggested that 0.5 SD is a reasonable approximation for the MID [10], although others feel this estimate is not generalizable [11, 12]. Thresholds of 1 SEM have also been used to estimate MIDs [13]. Other investigators, using data from patients’ ratings of their own global change [14] or from patients’ comparisons of themselves to others [15], determined that 5–10% of the instrument range represents a subjectively significant difference or clinically significant change.

It is important to determine the MID for various instruments (questionnaires) in a variety of cancers because a determination of p values does not provide information about the clinical meaningfulness of differences between groups or changes over time within a group. P values are highly dependent on sample size. In large sample sizes, significant p values can be obtained when numerical differences in HRQOL change scores are small and not likely to be clinically meaningful. As more and more studies examine the MIDs for differing questionnaires and cancers, it will become evident whether it is possible to generalize and adopt one MID or a set of MIDs for all questionnaires and patient groups. It will take a large number of such explorations to increase the confidence of investigators. Thus, every study contributing to this question is important.

The European Organization for Research and Treatment of Cancer Quality of Life Questionnaire core 30 (EORTC QLQ-C30) assesses HRQOL in cancer patients with 15 scales, each ranging in scores from 0 to 100. Anchor-based methods have been used previously to inform the interpretation of QLQ-C30 scores [14, 16]. Using global ratings of change as the anchor, Osoba et al. [14] suggested that in patients with breast and small-cell lung cancers, changes in scores of 5–10 represented a small difference; 10–20 represented a moderate difference, while those above 20 represented large differences. Using a variety of clinical classifications as anchors, King [16] obtained similar results when she collated results from various studies and various cancer sites. Based on these two studies, mean differences of 10 points or more are widely viewed as being clinically significant when interpreting the results of randomized clinical trials that use the QLQ-C30 [17]. However, the evidence is not clear that a 10-point threshold is applicable to each of the 15 QLQ-C30 scales [17]. Further, it has not yet been established whether the same thresholds apply to improvement and deterioration in HRQOL scores. Additional empirical investigation of the size and patterns of MIDs across domains of the QLQ-C30 is therefore justified.

Our focus was to determine the change in selected QLQ-C30 scales which corresponds to the MID for improvement and deterioration in HRQOL for non-small-cell lung cancer (NSCLC) patients. Identification of MIDs was carried out using two clinical anchors: change in physician-rated WHO PS and weight change. Since no MIDs on the QLQ-C30 have been determined for NSCLC patients and since MIDs may vary across patient groups, we focused on NSCLC with the intention of analyzing other sites later.

Patients and methods

The EORTC QLQ-C30

The QLQ-C30 contains both single- and multi-item scales. Of the 30 items, 24 aggregate into nine multi-item scales representing various HRQOL dimensions: five functioning scales (physical, role, emotional, cognitive, and social), three symptom scales (fatigue, pain, and nausea), and one global measure of health status. The remaining six single-item scales assess symptoms: dyspnea, appetite loss, sleep disturbance, constipation and diarrhea, and the perceived financial impact of the disease treatment. High scores indicate better HRQOL for the global health status and functioning scales but worse symptoms.

Description of the data and selection of QLQ-C30 scales

Two closed EORTC randomized controlled trials enrolling in total 812 palliative, locally advanced, and/or metastatic NSCLC patients were jointly analyzed. Trial 1 compared gemcitabine + cisplatin and paclitaxel + gemcitabine to the standard arm paclitaxel + cisplatin, enrolled 480 patients, and used the QLQ-C30 version 3 [18]. Trial 2 compared two cisplatin-based combination chemotherapies, enrolled 332 patients, and used QLQ-C30 version 1 [19]. These two versions of the QLQ-C30 differ only in the response options for the items in the physical and role functioning domains. Version 1 uses a binary (no, yes) scale and version 3 uses a four-point scale ranging from “not at all” to “very much” [20]. In both trials, HRQOL was measured as a secondary endpoint at baseline, during treatment, and on several follow-up occasions after the end of treatment.

Physical (PF), role (RF), and social (SF) functioning, global health status (GHS), fatigue (FA) and pain (PA) were chosen for this analysis because they were expected to show relatively strong association with the chosen anchors. Indeed, correlations between the other scales and the anchors were relatively weak (data not shown).

Both trials involved the same cancer site and had similar treatment modalities; thus, data from these trials were pooled. Due to the mentioned differences in the versions of the QLQ-C30, analysis for PF and RF was restricted to trial 1 which used version 3, the current version [20].

Clinical anchors

The anchor-based approach to developing MIDs requires an anchor that is itself interpretable and at least moderately correlated with the instrument being explored [2]. The chosen clinical anchors are clearly definable and understandable, they are commonly used by clinicians in assessment of cancer patients, and they have previously been shown to be correlated with HRQOL assessments of cancer patients [16, 21, 22]. Values for the WHO PS range from 0 (no symptoms of cancer) to 4 (bedbound). Changes in PS were categorized into three groups: deterioration (PS worsened by one category), no change (PS stayed the same), and improvement (PS improved by one category). Following CTCAE [23] guidelines, changes in weight were grouped as weight loss (5 − <20% loss), no change (<5% loss or gain of total body weight), and weight gain (5 − <20% gain). Patients whose PS changed by two or more categories or body weight changed by ≥20% (conventionally classified as severe loss [23]) were excluded since such changes were considered to be more than “minimal” in terms of their clinical relevance in this patient population.

Data analysis

Our focus was on changes in individual HRQOL scores of patients over time. Separate analyses were conducted using anchoring by PS and by weight change, respectively. For each analysis, patients with data on the anchors and HRQOL scores at 2 or more time points were included. The points furthest apart in time, denoted T1 and T2, provided a better chance of observing changes in HRQOL scores and were therefore used for analysis.

Differences in the anchor values and HRQOL scores between T1 and T2 were calculated for each patient. Eleven patients who deteriorated by more than one PS category were excluded from the PS analysis. No patients improved by more than one PS category. Three patients who lost ≥20% total body weight were excluded from the weight loss analysis. No patients gained ≥20% of their weight.

The differences in individual patient’s HRQOL scores were then assigned to one of three “clinically meaningful” categories, as defined a priori by the anchors, e.g., “improvement”, “no change”, and “deterioration” groups for PS (see “Clinical anchors”). We obtained estimates of the MIDs by calculating the difference in mean HRQOL change between adjacent categories [13], i.e., “improvement” versus “no change” and “no change” versus “deterioration”. This was done to control for the amount of change in HRQOL that occurred to patients who did not change according to the anchor. The 95% confidence intervals (CI) for the differences in mean of change scores were calculated.

The association between HRQOL scores and anchor values, and between changes in both the anchor and HRQOL scale, was quantified by the Spearman rank correlation coefficient. Revicki et al. [24] suggested a correlation of at least 0.30 as a measure of an acceptable association.

For comparison purposes, three distribution-based approaches were applied: 0.5 SD, 0.20 SD, and the SEM. The SEM measures the precision of the HRQOL instrument [8]. We calculated SEM using SD at T1 and T2, separately, and test–retest reliability estimates provided by Hjermstad et al. [25]. Our results were also compared with the 5–10% range of the instrument [15].

Results

Table 1 gives a summary of selected demographic and clinical characteristics of the patients at baseline in the combined data from the two trials.

Table 1 Selected baseline demographic and clinical characteristics of the patients

Descriptive statistics summarizing the distributions of HRQOL scores at baseline are given in Table 2. The distributions for PF, SF, RF, PA, and FA were skewed, with a predominance of good functioning and low symptoms, while GHS was reasonably symmetrical. The mean and standard deviations of HRQOL scores at the two time points T1 and T2 are also given in Table 2.

Table 2 Summary statistics for HRQOL scales at baseline and at T1 and T2, cross-sectional correlation estimates for HRQOL scores with PS, and correlations between HRQOL change scores with changes in both anchors

From Table 2, the cross-sectional correlations of HRQOL measures with PS were generally moderate, ranging in absolute value from 0.30 to 0.44. Correlations for GHS and SF at T1 (−0.29, −0.23) and PA at T2 (0.24) were relatively weak. Except for appetite loss, cross-sectional correlations for all other scales of the QLQ-C30 with PS both at T1 and T2 were less than for the scales we chose (data not shown). For changes in HRQOL scores and changes in both anchors, the correlations were generally weak (ranging 0.03–0.21 in absolute value).

When anchoring with PS, the number of days between T1 and T2 ranged from 20 to 161 with a mean of 76 (SD = 34.2) for trial 1. For trial 2, the number of days ranged from 20 to 194 with mean of 88 (SD = 37.9). A very similar distribution for the time separation was observed when anchoring with weight change. Including the number of days between T1 and T2 as a covariate in a regression model that related changes in HRQOL scores to changes in the anchor showed no statistically significant effect (p > 0.05) of time separation on changes in HRQOL scores for each of the scales analyzed. Further, addition of “study effect” to the regression model showed no statistically significant differences in change scores between the two trials, supporting the idea of combining the two trials.

The mean change scores for the selected QLQ C-30 domains and corresponding differences between adjacent categories are presented in Tables 3 and 4, anchored by PS and weight change, respectively.

Table 3 Performance status—mean (SD) of HRQOL change scores in the three anchor-defined groups and the difference in mean change scores (95% CI) between adjacent categories
Table 4 Weight change—mean (SD) of HRQOL change scores in the three anchor-defined groups and the difference in mean change scores (95% CI) between adjacent groups

As an illustration, in Table 3, the first difference in PF mean change of adjacent categories is obtained as 3.6 − (−5.3) = 8.9 PF units, and the second is calculated as −5.3 − (−9.7) = 4.4 PF units, providing MID estimates for improvement and deterioration, respectively, similarly for weight change. For the PS results, the 95% CI for the difference in mean of change scores did not include zero, suggesting statistically significant differences between the “improvement” and “no change” groups for all scales except for SF. For “no change”–“deterioration” comparisons, only SF showed a statistically significant difference. No statistically significant difference was observed between the “weight gain” and “no change” groups for all scales, while for “no change” versus “weight loss”, all scales except RF and GHS showed statistically significant differences.

Table 5 displays the anchor-based MID estimates adjacent to the distribution-based MID estimates. Since the SEM, 0.5 SD and 0.2 SD estimates at T1 and T2 were very similar, and not systematically different across the different scales and across the anchors, only results at T1 based on PS were reported.

Table 5 Anchor-based MID estimates compared with distribution-based MID estimates

Discussion

The aim of our study was to determine the magnitude of difference in scores in selected EORTC QLQ-C30 scales that represents the MID in palliatively treated NSCLC patients. Our approach was to link changes in HRQOL scores to groups known to have changed in terms of clinically relevant anchors, in this case, PS and weight change. In general, the mean changes in HRQOL within each anchor-defined group were in the expected direction.

It is notable that while not being definitive, the MID estimates differed somewhat in size across scales and when anchoring with PS, MIDs for improvement tended to be larger than MIDs for deterioration. The former tended to be closer to the SEM, while the latter tended to be closer to the 0.2 SD estimates. This provides further evidence that the 0.5 SD may represent a “medium” effect size [7], whereas 1 SEM may approximate a threshold for defining the MID [8]. In line with our results, Samsa et al. [9] suggest that 0.2 SD may provide a better estimate of MID than 0.5 SD.

The suggestion that a larger degree of change may be required to be meaningful when a patient is improving compared to worsening contrasts with a number of studies that have reported higher MID estimates for deterioration compared to improvement [14, 15, 26]. One possible explanation of our findings is that physicians may have misclassified the PS of patients, particularly those who they thought had stable PS. Our results suggest that patients classified as having not changed in PS had actually deteriorated in HRQOL scores. However, there is no valid a priori reason to suggest that there are differences in the way physicians assessed PS in our study relative to other studies. If a subconscious bias exists that makes it more likely for physicians to report PS as stable or worsening, rather than improving, then a larger MID for improvement would be found by our anchoring method. This is consistent with optimism bias, the same cognitive bias which may lead to the opposite result (a larger MID for deterioration) when patient ratings are used to anchor MID [27]. Our results are supported by the relatively large sample size available in this study, but further investigations in other cancer sites are required to confirm our results.

It is also possible that the differences in MIDs for improvement and deterioration were due merely to sampling variation. Our findings are based on the largest samples in the literature to date for determining MIDs from HRQOL change scores and particularly for considering improvement versus deterioration. Nevertheless, when the 95% CIs are taken into account, there is considerable overlap in the MIDs for all scales for both improvement and deterioration. Thus, further studies of large samples of patients with cancers in other sites are warranted.

Due to the relatively weak correlations observed, we acknowledge that our anchors did not appear to work well for some of the subscales, e.g., SF and PA. It may be argued that such anchors may not be used for such scales. Other studies using other anchors [14, 15] have also found only moderately strong correlations of the anchors with the HRQOL scores; the reason(s) is (are) unknown. For interpretation, it could be recommended to augment our anchor-based MID estimates with results from one of the distribution-based approaches by considering only those anchor-based MID estimates (see Table 5) at least equal to 0.2 SD [13], which is a “small effect” [7].

The clinical significance of weight gain/loss is not well established. While weight gain in some patients is a positive sign of improving physical condition, in some patients weight gain can be due to a buildup of ascites, which may in fact reflect increasing cancer activity. Similarly, weight loss may paradoxically reflect health improvement, for instance, discrete ascites or oedema that improved with treatment. This complicates the relationship between weight change and the changes in HRQOL scores. Therefore, results for scales (e.g., pain) exhibiting weak correlations with this anchor should be interpreted with caution.

The correlations between changes in either anchor (PS or weight) and HRQOL were not strong. The functional relationship between these changes is unknown and can be complex. Further, correlation based on changes in scores and anchor is likely to be smaller than cross-sectional correlations due to the measurement error in both anchor and HRQOL measures at both time points.

The changes that we are calling MIDs are based on the definitions and clinical anchors that we have applied, i.e., they are not “absolute” but are “relative” to the clinical anchors we used, as is always the case for MIDs based on the anchor-based approach. Indeed, issues about whether changes in HRQOL scores corresponding to changes in these anchors represent the MID, rather than just an important, anchor-dependent difference can be raised. This is an important issue which requires further research.

There is limited information about whether the value of the MID is stable across the continuum of illness, for example, can a change in PS from 0 to 1 be considered the same as a change from 3 to 4, when correlating with changes in HRQOL scores? Patients in our trials had to have a fairly good PS to get into the trial (see Table 1); therefore, we did not have the data to address such an important issue.

The patients in our data showed a predominance of good functioning and low symptoms at baseline. Since MIDs can vary across the spectrum of the EORTC QLQ-C30 scores, it is conceivable that our results could be different if we had predominantly worse patients. Being retrospective in nature, our analysis was restricted to using only WHO PS and weight loss, which were deemed credible anchors for the selected scales. We could possibly have used different anchors if available or could still have used the anchors we considered, together with any other credible anchors if available. However, using different anchors or anchor types (e.g., subjective vs. objective or prospective vs. retrospective) can also lead to different conclusions regarding estimates of the MID [8].

In conclusion, our findings provide estimates of MIDs for NSCLC patients in a selected subset of the EORTC QLQ-C30 scales. Differences in MIDs for improvement and deterioration were observed albeit not definitive. Our MID estimates are in line with the findings of Osoba et al. [14]. These estimates generally agree with the estimates of 5 to 10 units of the QLQ-C30 scales we tested and as proposed by Osoba et al. [14] and King [16]. We suggest that they may be used as guidance for clinicians and researchers to classify patients as improved or deteriorated in HRQOL and symptoms over time, and thence to determine the proportion of patients benefiting from treatment. They can also be used for sample size determination in the design of future clinical trials.