Introduction

Randomized controlled trials (RCTs) inform and influence the practice of medicine. RCTs and systematic reviews are increasing in number, quality, and impact. Nevertheless, applicability to decision-making at the individual patient level remains limited. One of the greatest challenges is the assessment of clinical significance. While some outcomes are intuitively meaningful, such as death or hospitalization, most benefits come in a range of magnitudes, from trivial to truly important. For instance, reduction in pain, increased exercise tolerance, or improved mood or outlook on life can be so small as to be barely detectable, or so large as to constitute a major change in health status. Even significant events (e.g., heart attack or stroke) come in a range of magnitudes of severity and importance.

Health-related quality-of-life questionnaire instruments (HRQoL) are designed to measure symptomatic and functional outcomes for both acute and chronic illness [1]. HRQoLs can be classified as predictive, discriminative, or evaluative, according to their purpose and use [2]. Predictive instruments are designed to predict subsequent events. Discriminative instruments help distinguish, classify, or diagnose. Evaluative instruments are “used to measure the magnitude of longitudinal change in an individual or group on the dimension of interest” [2]. Representation of domains important to patients and clinicians (content validity), reliability (proportion of measurement not due to chance or error), and responsiveness (sensitivity to change over time) remain the key parameters used to assess HRQoL instruments.

Recognizing that very small, imperceptible, and/or clinically irrelevant changes can be demonstrated by sensitive instruments in large trials, and that small but clinically important changes can be missed by insensitive instruments and/or small studies, Jaeschke, Singer, and Guyatt introduced the concept of “minimal clinically important difference” in 1989, defining it as “the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troubling side effects and excessive cost, a change in the patient’s management” [3]. Since then, dozens of studies have sought to assess “minimal important difference” (MID) for a variety of conditions, using a variety of methods [49]. The MID concept has special appeal, as it corresponds to the magnitude of benefit for which a RCT should be powered in order to minimize risks of false positive and false negative trials. Additionally, MID may serve as a benchmark of clinical significance when interpreting trial results, creating health policy, or making treatment decisions.

Two general approaches are used to estimate small-but-important-difference: anchor-based and distribution-based [1017], also described as external and internal [18]. The most widely used anchor-based approach was developed by Guyatt et al. [19], Jaeschke et al. [3], and Juniper et al. [20, 21] and compares interval changes in HRQoL scores to global rating of change (GRC) scores. The magnitude of interval change corresponding to a self-assessed GRC of “a little better” or “somewhat better” is interpreted as the standard MID. Another patient-centered method pioneered by Wells et al. [22] and Redelmeier et al. [5] allows patients diagnosed with the same condition to meet and discuss symptoms and functional impairments, then compare their health status with their conversational partner(s). Alternatively, physician assessments [9, 23] or other external measures [2426] can be used as anchors. In order to control for possible confounders, and to take account of within-person-over-time dependencies, formal regression-based statistical methods have been proposed [11], but are not widely implemented.

Distribution-based assessments take a variety of formats, but are all based on either within-group change-over-time or between-group comparison [2729]. Comparisons may include: (1) before versus after treatment, (2) before vs. after natural change, or (3) between treatment and control groups at one or more points in time. Observed absolute differences are often standardized by dividing by standard deviation, yielding the standardized effect size (ES). Various ES ranges have been attributed varying levels of clinical significance. In an influential 1969 text, Cohen designated an ES of up to 0.2 as “small,” 0.5 as “medium,” and 0.8 or more as “large” [30]. This is problematic in that population variability, conceptually independent from clinical significance, strongly influences ES estimation. Nevertheless, the ES method has appeal, as it is time-tested, widely understood and central to many psychometric indices [31].

The standard error of measurement (SEM) has been proposed as a measure for powering and interpreting RCTs The SEM is a theoretically fixed psychometric property that expresses magnitude in the same units as the original measure. SEM is defined as σ x (1−r xx )1/2, and thus incorporates both estimates of both reliability (r xx ) and variability (σ x ). Wyrwich and coauthors investigated the relationship between SEM and MID, reporting substantial consistencies between these theoretically distinct entities [29, 32, 33]. Across a reasonably wide range of chronic conditions, magnitudes of SEM and MID appeared similar, hence the recommendation that one SEM could be used as a benchmark when investigators want a distributional approach consistent with patient-centered anchor-based methods. In a similar vein, Norman, Sloan and Wyrwich reviewed 38 studies suitable for comparing MID to standardized ES, found “remarkable uniformity,” and concluded that “the threshold of discrimination for changes in health-related quality of life for chronic diseases appears to be approximately half a SD.” They found only two suitable studies of acute conditions, where “meaningful change” corresponded to .80 and 1.38 SD, a “somewhat larger magnitude of change than those in the other [chronic] studies considered.” This finding agrees with our impression that very little data is available regarding assessment of important difference and responsiveness for acute conditions.

The purpose of the current paper is to compare distributional and anchor-based approaches to assessment of important difference and responsiveness in acute upper respiratory infection (common cold). We do this using a data set generated by people with colds who self-reported symptoms on a validated questionnaire instrument once daily throughout the length of their illness.

Methods

Instrument

The Wisconsin Upper Respiratory Symptom Survey (WURSS) was developed as an evaluative illness-specific quality-of-life instrument [34]. A predecessor questionnaire (15 items, 9-point response range) was used in a randomized trial testing echinacea as treatment for common cold [35]. After that trial, using mixed qualitative and quantitative methods, the questionnaire was developed into the 44-item WURSS-44 instrument used in the present study. Development methods included open-ended elicitation of symptoms and dysfunctions from people with colds, followed by iterative assessment of content validity and ease-of-use assessed with in-person interview and focus group methods [34].

The WURSS-44 includes 1 global health item, 32 items rating specific symptoms, 10 functional quality-of-life items, and 1 GRC question (Table 1). Using responsiveness and importance-to-patient as guides, we selected best items for a shorter version, the WURSS-21, which is now undergoing prospective validation. Items of the WURSS-21 are a subset of WURSS-44 items, and include 1 item rating global health, 10 items rating symptoms, 9 functional quality-of-life items, and 1 item rating global change. All use similar 7-point Likert-type scales. Both instruments are available free-of-charge for educational, public health, and nonprofit purposes: http://www.fammed.wisc.edu/wurss/. The construct validity of the WURSS-44 is supported by assessments of face validity, importance-to-patients, responsiveness, and dimensional cohesion [36]. External (convergent; concurrent) validity was assessed through comparisons to the Jackson scale [37], the SF-36 [38], and the SF-8 [39] (24 h recall version), and laboratory-assessed biomarkers including viral titer, nasal neutrophils, mucus weight, and interleukin-8 [40]. Development [34], and validation [36, 40] details are more fully described elsewhere.

Table 1 Content of the Wisconsin Upper Respiratory Symptom Survey

Participants: enrollment and monitoring

To be eligible for this study, adults had to answer “Yes” to either “Do you think that you have a cold?” or “Do you think that you are coming down with a cold?” Participants were required to have a Jackson score [37] of three or more, calculated by summing severity scores 1 = mild, 2 = moderate, 3 = severe for eight symptoms: sneezing, runny nose, nasal congestion, sore throat, cough, headache, malaise, and chilliness. At least one of the first 4 (cold specific) symptoms had to be present. None could be present for more than 48 h. Allergy was excluded by asking about itchy eyes, sneezing, and previous allergy diagnosis. Callers were screened by phone, then met in person for informed consent and enrollment, following a protocol approved by the University of Wisconsin Institutional Review Board. Participants filled out the WURSS-44 at enrollment, then once each day until they indicated that they were “Not sick” for 2 days in a row. Telephone contact was attempted each day in order to enhance adherence to protocol. Participants were met for an exit interview within a few days after their colds had ended.

Anchor-based methods

The anchor-based methods used here compare retrospective self-assessment of improvement or worsening to prospectively assessed WURSS interval changes. All comparisons are on consecutive days, approximately 24 h apart. The GRC item starts with the question, “Compared to yesterday, I feel that my cold is...,” with “Better,” “The Same,” and “Worse” as initial response options. Those who feel that their colds have improved (“better”) or deteriorated (“worse”) are then asked to rate the degree of change using the following format:

If Better

If Worse

1. Almost the same, hardly any better at all

1. Almost the same, hardly any worse at all

2. A little better

2. A little worse

3. Somewhat better

3. Somewhat worse

4. Moderately better

4. Moderately worse

5. A good deal better

5. A good deal worse

6. A great deal better

6. A great deal worse

7. A very great deal better

7. A very great deal worse

The originators of this standard MID method suggest that response options starting with “Almost the same...” be excluded from analysis, and that options “a little better” and “somewhat better” be lumped together and interpreted to correspond to “MID” [3, 1921]. Options starting with “moderately” and “a good deal” are interpreted as “moderate important difference,” and options starting with “a great deal” and “a very great deal” are interpreted as “large important difference.”

In order to assess the relationship of GRC to WURSS scores across time, across individuals, and across the entire spectrum of the GRC scale, we selected a multi-level mixed effect multivariate regression approach. Drawing on works of Brant et al. [11] and Yang and Goldstein [41] we developed multi-level multivariate linear regression models, providing variance-covariance matrices at the fixed-occasion level (daily increments), and at patient-levels, with initial intercepts modeled as random variables. Daily assessments are treated as repetitions at level 1 (indicated by t) nested under patients (indicated by i). Let Z t be a vector of indicator variables for t = 1, 2, 3, 4, 5, 6, and 7 for daily change from day 2–3, 3–4, 4–5, 5–6, 6–7, 7–8, and 8–9, respectively, and y = the change in the WURSS score for each patient at each time period. The general model for our data may be written as:

$$ {\mathop y\nolimits_{ti} } = \sum\limits_{t = 1}^7 {{\mathop \beta \nolimits_{0,t} }} {\mathop Z\nolimits_{ti} } + {\sum\limits_{t = 1}^7 {{\sum\limits_{h = 1}^H {{\mathop \beta \nolimits_{h,t} }} }} }{\mathop Z\nolimits_{tij} }{\mathop x\nolimits_{h,ti} } + {\sum\limits_{t = 1}^7 {{\mathop \mu \nolimits_t }} }{\mathop Z\nolimits_t } + {\mathop {\text{e}}\nolimits_{ti} }{\mathop {\quad y}\nolimits_{ti} } \approx N{\left( {{\text{X}}{\rm B},\Omega } \right)}, $$

where x h,ti are covariates, with μ t N(0,Ω), and e ti N(0,Ω), where Ω are the variance-covariance matrices for the various levels. The model projects the amount of day-to-day WURSS interval change that corresponds to a single point difference on the GRC scale. We chose a single point GRC difference as reference point for three reasons: (1) there is a long line of research suggesting that people can reliably discriminate at this level [1, 4244], (2) several MID studies suggest that a difference of .5–1.0 points on a 7-point scale is significant [3, 5, 21], and (3) simplicity/interpretability.

Distribution-based methods

Calculation of means divided by pooled standard deviations is a straightforward, time-honored, and easily understandable process for most researchers and many clinicians. For this analysis, we base our ES (standardized ES) calculations on interval changes corresponding to the passage of approximately 24 h of time. We do this for three reasons: (1) there are no proven effective treatments for common cold, hence no intervention-based standards are available for comparison; (2) one day (about 24 h) is the interval duration between WURSS assessments used for the GRC anchor-based methods described above; and (3) qualitative evidence from our experience with hundreds of cold-sufferers suggests that the natural average daily improvement in a cold is a small but significant change.

Although definition of the SEM as σ x (1−r xx )1/2 is relatively well-accepted, several distinct methods of estimating reliability (r xx ) and variability (σ x ) are available. For reasons of ease-of-interpretation and consistency with published literature, we selected Cronbach’s α as the measure of reliability and standard deviation as the measure of variability for use when calculating SEM. Following Wyrwich et al. [29, 32, 33], we used a single unit change in the SEM for comparative estimation of clinically important difference. For all methods, calculation of sample size assumed two-tailed comparison and tolerance for Type I error rate of α = .05, with tolerance for Type II error rate set at β = .20 (80% power).

Results

Recruitment of study participants ran from March 2002 to August 2003, during which time 737 callers responded to community advertising in Madison, Wisconsin. Of these, 167 made it through enough telephone screening to be deemed eligible for enrollment, and 157 of these were met in-person for informed consent and enrollment. Participants then filled out questionnaires once each day of illness, to a maximum of 14 days, then met study personnel for an exit interview. About 150 participants were followed throughout their colds, documenting 1,681 days of symptoms. Excluding days when they said they were no longer sick, and including the estimated length of time from first symptom until enrollment, the mean duration of illness was 9.1 days (SD = 3.7). Of the 7 participants lost to follow-up, six could not be contacted after the initial interview despite multiple attempts. The 7th was unable to return from travel to another state, but did send back paperwork documenting the first three days of her illness. Characteristics of the 150 participants completing protocol are shown in Table 2.

Table 2 Participant characteristics

The mean total WURSS-44 score was 91.3 at enrollment, with a SD of 48.5, and an inter-quartile range of 54–125. The mean total score increased slightly to 93.5 on day 2, then dropped by an average of 8.2 points per day over the next 6 days (range = 6.5–10.0 point decrease per day). The downward trend continued, albeit more slowly, after day 7, with scores stabilizing in the 40 s for days 10–14 (for participants whose colds lasted that long). Dividing interval changes from days 2 to 7 by two-day pooled SDs yielded ESs of .13–.19 (mean = .16), small by Cohen’s arbitrary standard [45], but perhaps important if one accepts the argument that one day’s average natural improvement in common cold illness is significant. See Table 3.

Table 3 Between-day responsiveness using distributional and anchor-based methods

We analyzed a total of 704 consecutive between-day comparisons in which participants rated themselves as “better.” On 335 between-day comparisons when participants said they were “a little” or “somewhat” better, WURSS-44 scores decreased by an average of 16.7 points (95% CI = 14.5, 18.9). This represents the standard MID, using the GRC-based method described above. For the 208 between-day comparisons in which participants rated themselves as “moderately” or “a good deal” better (“moderate important difference”), WURSS-44 scores improved by an average of 23.1 points (95% CI = 20.1, 26.2). For the 161 between-day comparisons rated as “a great deal” or “a very great deal” better (“large important difference”), WURSS-44 scores improved by an average of only 15.6 points. Possible reasons for this discrepant finding are provided in the Discussion section.

As colds tend to improve over time, there were fewer between-day comparisons in which participants rated themselves as “worse” (N = 246). On 139 between-day comparisons when participants said they were “a little” or “somewhat” worse, WURSS-44 scores increased by an average of 11.1 points (95% CI = 7.5, 14.6). For the 83 between-day comparisons in which participants rated themselves as “moderately” or “a good deal” worse, WURSS-44 scores deteriorated by an average of 20.5 points (0.48 points per item; 95% CI = 14.5, 26.6) For the 24 between-day comparisons rated as “a great deal” or “a very great deal” worse, WURSS-44 scores increased by an average of 33.6 points (0.78 points per item; 95% CI = 16.9, 50.2).

Standard MIDs were calculated for all between day comparisons from day 2 to day 9 (Table 3). The interval from day 1 to day 2 was excluded because participants’ symptoms tended to worsen slightly during the first 24 h after enrollment. We excluded data from after day 9 because of small sample size, reduced rate of symptom severity decline, and concern that recovery might be introducing recall bias into GRC ratings. Day-to-day standard MIDs from day 2 to 7 ranged from 16.5 to 25.5 (mean 22.0). The standard MID for the day 8 to day 9 interval was 8.5, somewhat lower than previous intervals. Given these factors, our judgment is that the time range from day 2 to day 7 was best for estimating MID and responsiveness for this data set. Dividing the mean standard MID of 22.0 for days 2 to 7 by the pooled SD of 51.7 for this period provides a standardized coefficient of .425.

Two-level regression models described above were used to project the magnitude of interval change on the WURSS instrument corresponding to single point differences on the GRC scale. A series of models sequentially adding and removing variables demonstrated that the covariates age, gender, education, ethnicity, smoking status and Jackson score did not significantly effect the projections. Hence, these variables were not included in the final models yielding the indicators shown in Table 3. These models showed that single point differences on the GRC scale corresponded to interval changes of 8.4–12.4 points (mean = 11.3) on the WURSS-44 over days 2–7.

The model-based estimate of 11.3 points yields a coefficient = .218, implying that some 302 participants would be needed in each arm of a trial (assuming two-tailed testing, α = .05 and β = .20.) For the standard MID estimate of 22.0 points (coefficient = .426), only 89 people would be needed for each group, using the same assumptions.

The GRC item is worth focused attention, as it is the anchor for both standard and regression-based MID methods. We present its over-time distribution in Fig. 1. It is readily apparent that at enrollment on day 1, the majority of people felt that they had worsened, with a substantial number describing themselves as “the same” (Retrospective assessment at enrollment indicated that the mean time from first symptom to enrollment was 30.9 h.). On day 2, slightly fewer people, but still a majority, indicated worsening, with a few more in “the same” and “better” groups. By day 3 and subsequently, a gradually increasing number of participants rated themselves as improved, with the sample size gradually declining as declarations of “not sick” exited people from the study. These results are consistent with previous research and theory [46], and considered along with expected associations of GRC data with WURSS scores, serve to support validity of the GRC item.

Fig. 1
figure 1

Distribution of GRC scores by day of illness

Distribution-based ES methods provide the smallest estimates of small-but-significant change. Effect sizes for between day comparisons for days 2–9 are shown in Table 3, and range from 0.06 to 0.19 for the WURSS-44, and from 0.07 to 0.20 for the WURSS-21. Averaging ES over days 2–7 provides coefficients of 0.157 for the WURSS-44, implying that a two-armed trial would require 870 participants per group, again assuming two-tailed testing, with α = .05 and β = .20. For the WURSS-21, the average ES coefficient was .162, which implies a needed sample size of 858.

Accepting one SEM as an indicator of small-but-important change provides slightly larger ES coefficients than ES, and thus slightly smaller required sample sizes to detect these changes. Cronbach’s α, within-day standard deviation, and resulting SEMs for days 1–7 are portrayed in Table 4, along with standardized coefficients and sample sizes. A seven day average standardized SEM is .177 for the WURSS-44 and .221 for the WURSS-21. This suggests that a trial would need group sizes of 616 and 520, respectively, for power-to-detect-difference.

Table 4 Reliability, variability, and standard error of measurement

It may be worth pointing out that the shorter instrument, the WURSS-21, appeared to be more responsive than the WURSS-44 in the original validation study [36]. Using the estimates of small but important change of 12.7, 5.87, 5.99, and 4.53 points for standard MID, regression-based MID, SEM, and ES, respectively, for the WURSS-21, treatment group sample sizes of 80, 306, 520, and 858 would be needed. These corresponding projected sample sizes for the WURSS-44 are 89, 302, 616, and 870. We consider these to be conservative estimates, as a well-designed RCT would look at outcomes over the full duration of illness rather than a single between-day comparison, which would presumably increase power to detect between-group difference.

Discussion

In 1957, Cronbach described “the two disciplines of scientific psychology” as “correlational” and “experimentalist” [47]. Correlational psychologists used multivariate and factor analysis techniques to investigate relationships within cross-sectional and prospective cohorts. Experimentalists used RCTs to isolate the effects of interventions, to establish causality, and to estimate the magnitude of change attributable to interventions. For correlationists, diversity was an ally, as it increased power to detect between-person relationships using within-person variables. For experimentalists, diversity was an obstacle. The greater the diversity across individuals, the more difficult it was to detect changes over time attributable to interventions. In this article and others [48, 49], Cronbach correctly pointed out that experimentalists could say little or nothing about the effects of interventions on individuals. While the average score of the treated group in an RCT might change by a significant amount compared to the control group, this could be due either to large changes in a few individuals or smaller changes across a greater number. For the clinician and the patient, the likelihood and magnitude of change for the individual was the key issue. Because individual trajectories could not be known a priori, neither correlational nor experimental science could accurately predict individual response to treatment.

In 1969, Cohen published “Statistical Methods for the Behavioural Sciences” [45], which became a popular resource for investigators designing and interpreting RCTs. In that text proposed the standardized ES an appropriate coefficient to assess change-over-time (later termed responsiveness). Dividing the absolute difference (“raw gain” in the words of Cronbach) by the standard deviation yielded the ES, which, when complemented by tolerance for type I and type II error, defined power and sample size requirements. As mentioned earlier, standardized ESs up to 0.2 were described as “small,” 0.5 as “medium,” and 0.8 as “large.”

The 1970s, 1980s, and 1990s saw the widespread introduction of HRQoL instruments as RCT outcome measures. This, along with a rising interest in the assessment of clinical significance, led to a variety of methods for assessing important difference and responsiveness. Working within the ability-to-detect-change framework, Deyo and colleagues compared HRQoL responsiveness to diagnostic test performance [12, 50]. Kazis followed a similar track, concentrating on ES [27]. Guyatt et al. [19], Jaeschke et al. [3], Juniper et al. [20, 21], Redelmeier et al. [5], and Wells et al. [22] adopted an entirely different approach, using patient’s value judgments to rate degree-of-change. Global ratings of change (GRC) were introduced, refined, and used as anchors for assessing interval changes. Perhaps due to simplicity and face validity, these methods were picked up by several research groups, leading to a body of work that arguably defines the state-of-the-art for assessing important difference and responsiveness [1017]. Wyrwich has compared resulting MIDs from these studies to corresponding SEM calculations, and has noted a consistent relationship [29, 32, 33].

As Norman, Stratford and Regehr [15] have noted, there are several potential problems with these approaches. The GRC item itself has not been extensively validated. As a single item, it cannot be assessed by reliability measures such as Cronbach’s α. While it could be assessed by test-retest methods, interpretation would be problematic due to the time-specific nature of the rating. There are also doubts regarding people’s ability to retrospectively assess severity change over time. Effects of context, framing, concurrent severity, implicit theory, and duration of recall could bias these judgments [51, 52]. Brant, Sutherland, and Hilsden have argued that simply reporting means and standard errors of interval changes corresponding to specific GRC responses is “naive,” and have proposed a formal system of regression that would take into account potential confounders as well as within-person-over-time dependencies [11]. Finally, as Norman et al. argue using both real and simulated data [15], the GRC scale may be somewhat insensitive to prospectively assessed interval changes.

Our data suggest that the most commonly used distributional and GRC anchor-based approaches yield somewhat different results, with substantive implications for RCT design. The amount of daily change that occurs during the natural resolution of a cold is less than the amount of change assessed as minimal but important by standard MID methods, or projected by GRC-based regression models. Nevertheless, we are impressed as much by the consistencies as the discrepancies. We interpret these results to support the construct validity of the WURSS instrument and the utility of the GRC scale as a useful measure. Relationships between WURSS and GRC ratings were consistent across individuals and over time, and were in general unaffected by potential confounders. The sole anomalous finding was that the WURSS-44 interval changes corresponding to GRC ratings of “a great deal better” or “a very great deal better” were somewhat less than those corresponding to “moderately better” or “a good deal better.” This was perhaps due in part to the tendency of people at the end of their colds to rate daily GRC self-improvement highly, even when prospectively measured WURSS end-of-cold interval changes are small.

We are indebted to the editor and a primary reviewer of this article for asking us to include SEM as a comparator. We were surprised and pleased to see that SEM calculation and corresponding coefficients and implications for power and responsiveness portrayed were reasonably consistent with other methods, and in general supportive of Wyrwich and Norman’s findings [16, 29, 32, 33, 53, 54]. While the sample sizes needed to detect MID, SEM, and ES may appear discrepant, the absolute amount of between-assessment change these various methods project is surprisingly consistent, given the theoretical influences of reliability, variance, and ability of individuals to accurately assess HRQoL over time. We are pleased that the WURSS-44 and the embedded WURSS-21 items appear to function equally well in these analyses. We expect that most users will prefer the WURSS-21, as it is shorter and easier to use.

There are, of course, several limitations to this study: Participants were volunteers, representing a limited socioeconomic and cultural profile. Our sample size was modest, hence confidence intervals are wide. All data were self-reported, subject to recall and other biases. While the underlying constructs of illness severity and degree-of-change are continuous, the self-reported data came in the form of ordered categorical measures, raising questions of normality and interval equivalence. Analyses reported here are based on simple-summed WURSS scores, implying unity weighting, which may or may not be appropriate. Day-to-day changes in severity scores may not be ideal outcome measures. There are reasonably strong theoretical arguments to instead use global area-under-the-curve measures, which would compound severity and duration into a single variable, and thus obviate the need for responsiveness assessment. Finally, and perhaps most importantly, there are neither known effective treatments nor independent gold standards for common cold severity assessment.

All approaches to important difference and responsiveness using HRQoL instruments are influenced by the population sampled, and by the conditions under which the subjects participate, as well as by the instrument itself. Responsiveness and important difference, like reliability, are properties of an instrument-in-use, not of an instrument as an isolated entity. Assessment of the amount of instrument-measured change that constitutes an important difference is value-based, and hence influenced by individual, cultural, and contextual factors. While distribution-based calculations are useful for the design of RCTs, they are insufficient for the interpretation of results. While clinician assessments and population norms can reasonably be used as reference standards, they should not be used to set benchmarks of clinical significance. Instead, it is the values of affected individuals that should be used to weigh the risks and benefits of treatments. For these reasons, we feel that GRC-based methods highlighted here are reasonably sound, if not completely satisfactory. For overall understanding of an instrument’s strengths and weaknesses, we suggest multiple complementary approaches. In addition to the methods reported here, direct between-person severity comparison 5, 11, 22], clinician assessments [9, 23], and benefit harm trade-off methods [55, 56] may be useful. Depending on the data available for a given HRQoL instrument, investigators, clinicians and patients will have to make judgments when predicting or interpreting the magnitude of treatment effects. We believe that the quality of these judgments will improve when data from several approaches are available, and when the strengths and weaknesses of specific strategies are more widely known.